Stephen Holiday

Spaced Repetition: My Learning Secret

2014-09-08T00:00:00-07:00

There are people who can study the night before a test and ace it. I am not one of those people.

Although I loved school, I was really bad at studying. In fact, I hated studying. I studied by reading my notes and the textbook, trying to make sure I understood the “fundamentals” and “core concepts”. Of course, I did this in the days or hours leading up to a test. Even when I started to study earlier, I found I needed to go back over content I already thought I had “learned”. Studying was boring, discouraging, and ineffective.

During university I stumbled upon a technique that allows me to learn and retain information extremely well — spaced repetition.

Spaced Repetition

The gist is you make a set of flashcards using software and practice them every day. Every time you review a card, you can tell the program if you knew the answer and how easily you recalled it.

Now, that’s easier said then done. This technique is not a shortcut to learning by any measure. Learning using spaced repetition takes a lot of time and effort. You must review your cards every day.

We all know that over time our memory for facts decays. It turns out that if we cause our mind to recall a fact right before we are about to forget it, we will strengthen the memory. In more concrete terms, if we learn a fact, recall it in increasing intervals of say 10 minutes, 1 day, 3 days, 5 days and so on we will be able to remember it for much longer. For a better overview of the science, checkout this Wikipedia article.

Spaced Repetition Software manages this for us. Here’s an example of one of my cards from a Computer Architecture course:

Note the time on this card is a little extreme because I haven’t reviewed this card in over a year. You can see that I provide a lot of context on my cards. I also included an excerpt of the lecture slide. I’ll talk more about constructing good cards in a bit.

Learning or Memorizing

When I’ve told friends about the technique, they are skeptical that I actually learn anything and posit that I am just blindly memorizing. I too initially thought flashcards were only for pure memorization.

I initially tried spaced repetition originally because a lot of concept heavy courses still had facts or rules that I frequently forgot. During an exam I would know what the question was asking, how to answer the question and even most of the facts or rules. However most is not all.

As I used spaced repetition more often, I found myself adding cards to “memorize” the relationship between concepts or ideas. For example I had a card to differentiate between conflict cache misses (a miss caused because the cache associativity is too low) and coherence misses (a miss caused because some external force invalidated a cache line, like in a SMP).

At first glance this appears like I’m memorizing the differences between the two concepts. In some sense I am, but if I can correctly contrast between the two misses have I not learned the concept? Even more interesting is that with this knowledge I can actually understand the implications in the real world.

After a while, I learned to create flashcards that required really understanding the material if I was to answer the questions correctly. This is when I started to see the benefit. Trying to memorize a card without understanding it can work for the short term, but after longer intervals I found recalling facts extremely difficult.

Creating Cards

I use Anki for my spaced repetition. It’s free for desktop and can sync with your mobile devices.

I skip class and instead sit in front of my large monitor with the lecture slides on one side of my screen and Anki’s card editor on the other. I read the slide (and surrounding slides) to determine what the concepts are. I create cards that cover these concepts from multiple angles. When in doubt I add more concepts than less.

The authors of another spaced repetition software, SuperMemo, has an excellent set of rules for creating cards called “The 20 rules of formulating knowledge in learning.” I’ll outline the rules I felt particularly helpful here.

4: Stick to the minimum information principle

It’s easier to learn small and concise pieces of information.

5: Cloze deletion is awesome

Almost all of my cards use Cloze deletion. It’s much more effective than standard flashcard with a front and back. You can use the cloze deletion for multiple parts of a concept. This allows you to recall the concept from different sides.

6 and 8: Use Images (and image cloze deletion)

Here’s an example of a Tomasulo’s CPU register renaming technique with a visual cloze deletion. Notice how the only the only item removed is the register status. That’s the level of detail that I found useful. I use the Anki plugin Image Occlusion to create image cloze deletions quickly.

9 and 10: Avoid Sets and Enumerations

Remembering sets and enumerations (ordered lists) is unnecessarily difficult. You can often convert these into more meaningful and useful cards. Usually I do this by contrasting and comparing concepts as in the inclusions versus exclusion cards earlier.

If you can’t find a better way to represent the concepts, cloze deletion on a list can really help out. This card describes what happens during a page fault. Every time I review the card I see all but one step, allowing me to think about the step in the context of the other steps.

11 and 16: Combat Interference

Interference is when two cards are similar and you confuse the concept and therefore the answer. Inference is extremely frustrating, especially if the two cards are separated by time.

Consider two similar concepts far apart from each other in the deck. You may learn the first concept perfectly. Later the second concept is hard to learn but you eventually figure it out. As time goes on the first concept comes up again and you’ve forgotten about it.

One of the ways I deal with interference is by providing lots of context in the card to differentiate concepts as in the cache hierarchy example. I sometimes go back to cards and modify them to be dissimilar.

12: Omit needless words

If you cards contain a lot of unnecessary words, you brain takes longer to make the connection to the concept. I tend to write verbosely so I actively try to limit the amount of wording on a card, eschewing grammar.

17: Redundancy is OK

I often have several cards that cover the same concepts from different angles. Sometimes using cloze deletions, images, or comparisons to similar concepts. These all can help strength memories. Approaching a concept from different angles also ensures you can recall the concepts in different situations.

A Topological Sort of Knowledge

A theory on learning I once heard was that in order to learn a concept you need to have a hook to hang it on. In other words, you need to see where an idea fits in the bigger picture before you can internalize it.

One way to do this for lectures is to pre-read lecture slides. That way when the professor is presenting the information, you’ll already know where it fits. I did this for my Data Structures and Algorithms class before I used flashcards and it was a boon for my understanding and my exam mark.

Ideally, we would be presented with information in the perfect order for learning, a topological sort of knowledge so to speak. However, it’s difficult (perhaps impossible) to construct a perfect ordering. Spaced repetition allows us to approximate this idea.

When you are reviewing cards, they will not necessarily be in the order the lecturer presented them. As you review cards that contain a concept and the surrounding concepts, you will be exposed to the concept in the context of the bigger picture through repeated interleaving. That is, in between cards on a particular concept the software will be showing you other cards that are often about related concepts.

As you see these interleaving, you’ll start to understand the similarities and differences between concepts. Once you can do this, you’ll be able to answer the cards quickly and with great accuracy.

Reviewing Cards

I review my cards every day. If you don’t they will build up. It takes me about an hour every day and I often split it up into two 30 minute sprints. Sometimes I find it convenient to review cards on my iPhone with the mobile app. Especially on a long bus ride. The desktop app is fine, but I use a plugin called Full Screen to maximize the window on my Mac to reduce distractions. Recently, I’ve started to stand while reviewing to limit my wandering mind.

Tony, while proofreading this post noted that he thinks I should attribute the better grades with Spaced Repetition to just studying more. He is probably correct. Spaced Repetition allows me to study effectively (not necessarily efficiently). I do spend a huge amount of time creating cards and reviewing them but the value I gain is immense.

Test Time

A great way to study for the kinds of exams I took in engineering is to practice problem sets or old exams. I find problems similar to or in the same style as those that will be on the exam. Before space repetition, each problem with a new concept required a haphazard search of lecture slides and my notes. Even once I had understood the concept and successfully completed the problem it was difficult to redo a similar problem later on during the exam study period. I couldn’t recall the specifics of the concept.

I found that because I had studied the concepts in the months leading up to the exam, the time I spend practice problems is efficiently used. When I did a problem, I had already learned the concept and its relationship with other concepts. Once I complete many of the practice problems, I’ll review the entire deck of cards for the course using the Filtered Decks feature. Filtered Decks allow me to refresh my memory on the entire course and to pinpoint areas that need concentration.

Conclusion

I urge students to consider alternative studying methods. It can be surprising what works for you. When you find a technique that works well for you it can change your whole attitude to exams and studying. The most valuable lesson I learned in university was how to learn.

##Acknowledgments

First I’d like to thank Derek Sivers of CD Baby fame introducing me to spaced repetition through this post. Second, the authors of SuperMemo for their excellent list on formulating knowledge. Third, my editor Tony Dong. Last, thanks to Damien Elmes for creating and maintaining Anki, the spaced repetition software I rely on.

That First Co-op Job

2014-08-07T00:00:00-07:00

My first term in Computer Engineering was hectic. Within a few weeks we were already applying to co-op positions for January, passing ourselves off as Waterloo Engineering Students even though we were still getting lost on campus. OK, maybe that was just me. The MC building is really confusing!

Every so often I receive an email or Facebook message from a Waterloo student with some great questions about finding their first job. I’ve culled through my responses and put together this collection of things I wish I had known about when I was first applying for jobs.

Other people you talk to may completely disagree with what I have to say, and that’s OK. The advice is based on the experiences of my friends and I. You should try to ask as many upper year students as possible!

The Process

I think it’s helpful to understand how the job application process works at Waterloo for engineers. There are some details on the Co-op website but I’ll give an overview here.

JobMine is where you will apply for jobs. There will be a list of jobs that you filter down. Some 800 odd listings that will match the broad filters of Computer, Electrical, and Software Engineer.

When you apply an employer will see a package that contains your co-op history (at this point empty), your transcript (embarrassingly short) and then your resume. You can upload tailored resumes or resumes with cover letters for a given position.

After a while, you’ll start to see that you’ve been rejected for jobs (don’t freak out!). Hopefully you’ll eventually get the word that you’ve been selected for an interview! See more about the interview process here.

Once the interview round is complete you’ll go through the ranking process described here.

Your Resume

Building your resume is one of the biggest factors that can get you an interview for your first job. Later on it can be your marks and list of previous co-op experiences the school prepends to your resume. Spend a ton of time on your resume and have your classmates look at it.

I think you should be able to get it to one page. Employers get bored. Especially when they are hiring for junior jobs. Your skills summary can be a great way to show off your key qualifications. If you’ve been awarded a rare scholarship, this is a great place to highlight it. But keep the skills section short and easy to read.

Some employers search your resume for keywords (Java, AutoCAD, VHDL, etc.). Some do it visually and others with a tool. If you list a skill on your resume it is good to have a corresponding job or project to backup the claim.

Keep in mind that in the real world, resumes are a little different. You should look for other CompEng resumes for ideas of how to write yours. Remember, on JobMine everyone is from Waterloo. You don’t need to put your education first. They will prepend your transcript to your resume anyway.

I also leave out my address. If someone needs to mail me something they’ll ask. Usually it’s once you have the job anyway so it will not matter (some people don’t like that though).

Kai Umezawa, in Systems Engineering told me that you should make your resume stand out. Everyone’s will look the same, so try to make yours eye catching. Don’t try anything too fancy or odd though, that might freak some people out. Look at other resumes for inspiration.

Cleaning It Up

Go to EngSoc’s resume critiques. I went to one every term for the first two years. They are incredibly helpful. I’ve volunteered for them as well and I’m also happy to look at your resume and give some feedback. You want a bunch of people to look over your resume.

Upper year students know what employers want; the university does not. Some of these upper year students have even hired the next co-op student.

Waterloo’s Centre For Career Action and EngSoc have published a resume tips package and while I don’t really agree with some of it, their Action Word List is awesome!

While I’m editing my resume I print it out and carry it with me. When I have free time I look at it, making notes and changes. I show copies to friends and ask what looks off to them. Many employers screen candidates by printing out the resumes, so make sure it looks good on paper in black and white!

Experience

While a lot of people do not have “traditional” experience, if you think hard about the stuff you’ve done, you probably have some experience with relevant skills. Maybe you had a lot of responsibilities when you were a volunteer, played team sports or managed a bunch of people.

Many people make the mistake of not applying for jobs because they don’t have the requirements. This is really bad. Employers are not expecting you to have everything. This list of requirements is their dream candidate.

Often the person writing the requirements is not even the same person who makes the hiring decision. My first job (a PHP job) has Ruby listed as a skill. I asked at work when I got there and they laughed and said no one has ever used Ruby for work there.

Projects

It was definitely my personal projects that got me hired the first time. Think about it, every applicant is going to be practically the same; top marks in high school, won a scholarship of some kind, and are good with computers. You need to stress your personal projects. I put them before my education on my resume.

If you don’t have much work experience in the field, personal projects are perfect. Employers love to see you like working on this stuff in your free time. It shows you enjoy the work and can work on your own. Even when I was interviewing for fulltime jobs, interviewers still asked a lot about my personal projects.

Other Resume Thoughts

You’d think everyone would use email but you need to make sure your phone number works and that it has voicemail. So many prospective employers have left messages on my voicemail. I have received job offers on my voicemail. You need voicemail.

Use a @gmail address. Hotmail looks bad. Some opt for @uwaterloo but that used to be unreliable (on engmail at least). While I could have used @stephenholiday.com I was too afraid of it not working correctly. Also, drag0n_sl4yer93@gmail.com isn’t scaring anyone into hiring you either.

For one of your courses you will have to hand in a resume. Do not hand in the one you are using for co-op. You will fail that assignment. Like I said, the University wants something different.

JobMine Strategy

Some people are afraid to apply to intermediate and senior jobs but don’t worry about that. You’ll get to pick up to 50 listings to apply to, I used all 50 and I’d recommend it for the first job. Although, you should know that you can’t say no to a job if it’s the only offer you get. Though I heard this might be changing. Later on you’ll want to dial it down once you have more experience on your resume. Otherwise you’ll have too many interviews. Doesn’t that just sound great?

There are two posting weekends where you get to apply for jobs. Once the first deadline is over, employers who didn’t get their job descriptions in fast enough will be in the second round. So you should leave some of the 50 for this. Once you get rejected from a couple jobs from the first posting you can apply to more jobs for the second.

The saving grace for many is the continuous round. It’s where all the jobs are that didn’t get filled in the first two rounds. It’s more random but is designed to find you a job. There’s some great stuff in there.

Interviews

Interview Questions

I prepare a lot for interviews. I spend more on that then any single class. “Cracking the Coding Interview” is a great resource for a lot of the questions you may be asked. Practice with friends, it’s very helpful. Derek says Elements of Programming Interviews is also an awesome resource.

My friends and I will ask each other questions from our interviews to add even more material to practice. There are a lot of sample questions online.

Earnings

An aside about earnings, you can see what the current going rate is here.

At the end of the interview, always ask about the pay. It’s good if you ask a bunch of questions first. Some people are afraid to ask the pay. Some employers are thrown off if you don’t ask the pay (aren’t you interested in the job?).

If you feel uncomfortable about asking, you can frame the question something like “do you know what the compensation is for this position?” Sometimes they actually have no clue, it’s a little weird but not uncommon in large organizations.

If an interviewer tells me “it’s competitive” I cite the salary survey and talk about how big that range is considering I’m paying for tuition. It is hard for an employer to determine a fair salary if this is their first time hiring a student.

“Stop Me If You Think You’ve Heard This One Before”

If you do coding questions and you have been practicing, you’ll often be asked a question you’ve heard before or is similar to another one you’ve done. Some believe you should tell the interviewer right away because they can tell. In my experience, I don’t and they can’t. However, if an interviewer asks me at any point if I’ve heard the question, I tell the truth.

My feeling is that if they are just asking questions they found online then I don’t feel that bad. As a potential future interviewer, I’d ask you to tell me but maybe I deserve it for asking a common question. However, just because you know the answer doesn’t mean you can explain it or code it up in an interview.

It’s not just about the code!

First impressions matter disproportionately. When you enter, be friendly, polite and upbeat.

Smile, even when you are on the phone. It may sound silly but when I smile on the phone my attitude and language is a lot more cheerful.

For your first job, they might not ask many technical questions. Don’t be discouraged. Waterloo’s Centre For Career Action has some thoughts on how to answer behavioral questions. Practice with your friends!

Practice common behavioral questions. For example, come up with three strengths and three weaknesses ahead of time. It’s really awkward when you list off your strengths and the interviewer says, “that was two.” On the bright side, you only need to list two weaknesses now…

A note about weaknesses, don’t do that “I work too hard” or silly stuff like that. I try to pick things that are not about my character but something I can learn. Then I tell them what I’m doing to change.

“It’s not about me, it’s about you.”

A friend told me the trick to being likeable is to repeat in your mind “It’s not about me it’s about you.” People love to talk about themselves and their interests. People like to hire people like themselves. They want to find someone they think they can get along with.

I find it very helpful to ask your interviewer about what they do and what they like about their work, even if it isn’t what you will be doing. Not only does it give you insight into how they work but it also allows the interviewer to talk about themself.

That’s why they call it work!

In the interview, you should not give the impression that you feel some tasks are below you (even if you secretly do). You want to get stuff done and you know it’s not all awesome. Sometimes the organization just needs stuff to get done.

That being said, I’ve loved most of my jobs and I often forget that someone is paying me to do the stuff I love.

You might not get a development job (if that’s your goal) for your first co-op job. And that’s OK. You have 6 co-op terms to grow!

Summary

Work hard on your resume
Apply to a lot of jobs
Practice interview questions

In case you couldn’t tell, I’m pretty passionate about this stuff. I’m happy to answer more questions or look over your resume. Just shoot me an email.

Congrats on choosing an excellent program!

Acknowledgments

First I want to thank Tony Dong and Vineet Nayak for proofreading this. Thanks also to Derek Thurn for suggesting the “Elements of Programming Interviews” and Adhiran Thirmal for some thoughts about skills summaries and keywords.

I’m forever indebted to the amazing upper year students who gave me career advice and helped me turn my bulleted list of randomness into a resume. Particularly Mehdi Mulani, Adam Flynn and those volunteers from the EngSoc resume critiques.

Thanks to my roommates over the years (particularly Parth Gajaria, Vineet Nayak, Tony Dong, and Andy Wu) for proofreading my resume, practicing interview questions with me and slowing down when I didn’t “get it”.

Finally, thanks to the people who run JobMine and the Centre For Career Action. Some may complain about the $500 or so fee in our tuition every term but I felt like I was getting a deal so good it seemed like theft.

My Backup Strategy

2014-07-13T00:00:00-07:00

I remember the first time I had a hard drive fail on me. I was pretty upset. It was a 6.8 GB drive that I scavenged in a garage sale like all of my Frankenstein’s monster computers. I had lost many documents and pictures before with faulty floppy discs and accidental deletions, but never before had I lost so much in one fell swoop. I was pretty young at the time so the documents and Photoshopped pictures were far from masterpieces, but nevertheless they mattered to me.

Had I known about SpinRite at that time I may have been able to resuscitate the drive and save a few files. SpinRite has saved many files for my friends and families who do not have a backup plan. It’s awesome and you should try it.

After that day I vowed to never lose a file again. It took me many years to setup a backup system that works for me and I learned a few things in the process. It has saved my Canadian bacon more times then I’d care to admit.

Theory

I’m a big fan of Scott Hanselman’s “3-2-1” backup strategy:

3 copies
2 different formats
1 copy off-site

Three copies may be a little extreme for some people, and that is fine, but when both my main machine and my main backup were acting up at the same time I was pretty happy to have another copy.

The second point regarding formats is an interesting one. My mother backs up all of our home movies to an external hard drive as well as DVDs she keeps off-site. She used to just have DVDs. However, optical media (especially if it’s from the same manufacture) will tend to degrade in the same way at a similar rate. While it’s helpful to have multiple copies in the case of scratches (though if you are backing up you probably aren’t leaving the media laying around), it doesn’t help if they both fail in the same way.

If your backup is beside you computer when you house burns down or when a burglar steals both your laptop and external hard drive it doesn’t really help.

Two more things I would add as a requirement for a good strategy:

Automatic
Version History

If my strategy was not mostly automatic, I’d have abandoned it years ago. At least one of your backups should be automatic to ensure it happens regularly. One of the benefits of automatic backup is you can often step up the frequency. If you only run your backup every Sunday and your drive fails on Saturday, you’ll be out a lot of work.

I also believe having a version history of your files is very important. I remember doing tedious data entry for a day and then accidentally saving over the file at the end of the day. I didn’t notice until the next day that my data was gone. It would have been extremely frustrating to through away all that work.

If my backup system had been running hourly and simply overwrote the file with the latest version, I would have lost all that work. However my backup system kept hourly versions for a week and weekly versions for a year.

Stephen’s Strategy

I use Apple’s Time Machine on two external drives plus CrashPlan.

Time Machine is simple and awesome. Whenever a drive is plugged in, OSX will automatically backup (with versions) all the data you ask it to. I also keep my backup drives (as I do with all my drives) encrypted with Apple’s FileVault. Apple does a great job of making full disk encryption really easy. I have two backup drives: one that stays in my apartment beside my laptop and another at my parents’ house. Whenever I go home to visit I run a backup from my laptop. It also keeps revision history of my files:

Hourly backups for 24 hours
Daily backups for the past month
Weekly backups for the previous months

But the best part of my strategy is CrashPlan. CrashPlan is an online backup service that I pay for. It is off-site and it has great encryption capabilities. I use my own encryption key with it so that the people at CrashPlan cannot even look at my files. It also backs up every five minutes!

CrashPlan is great when I’m working on my laptop while on the go away from my external hard drive. When I was in Tokyo I had one of the external drives with me but CrashPlan allowed me to keep an off-site backup of my photos in the case that my laptop and backup were destroyed or lost on the way back to Canada.

CrashPlan also keeps every five minute version forever under the settings I’ve chosen. This is probably unnecessary, but it doesn’t cost me anything extra.

Bottom Line

Have multiple copies, some off-site and make it automatic!

How Project Rhino leverages Hadoop to build a graph of the music world

2014-03-27T00:00:00-07:00

This last week Tony Dong, Omid Mortazavi, Vineet Nayak and I presented Project Rhino at this years ECE Capstone Design Symposium at the University of Waterloo.

Project Rhino is a music search engine that allows you to ask questions about the music world in plain English. Check out a demo video here.

We noticed there is a ton of data available about the music world but that they were disconnected. There was no easy way to explore the relationships in the music world.

This project allows a user to express an English query, that is then transformed into a traversal over the music data we have collected. All of the data was retrieved from freely available Creative Commons sources. However, integrating the data from disparate sources is non-trivial.

Here are some of the questions Project Rhino can answer:

Find artists similar to "The Beatles" and played in "Waterloo, Ontario"
Find artists from Canada and similar to "Vampire Weekend"
Find songs by artists from Australia and similar to artists from Germany
Show venues where artists born in 1970 and from Japan played

Aside from the complexities of data integration, the sheer volume of data (100+ GB of compressed JSON) makes this project challenging.

In this post I focus on one aspect of the ETL pipeline I developed for Project Rhino, the construction and insertion of the graph into our graph database.

The Graph Database

We used Titan on top of Cassandra to store our graph. It’s a pretty cool project and worth checking out. They provide a nice graph abstraction and excellent graph traversal language called Gremlin.

When we first came up with the idea for this project at Morty’s Pub, we were going to build our own distributed graph database from scratch. Titan saved us a lot of pain.

The people who produce Titan have a batch graph engine built on top of Hadoop called Faunus. It’s pretty cool but we didn’t end up using it for some of the reasons I’ll talk about later.

Rhino’s Batch Graph Build

The final and most intensive stage of the pipeline is graph construction. This is when the data is combined to create the graph. The graph is made up of two components: nodes with properties (ex. an artist) and edges between nodes with properties (ex. writtenBy). This stage outputs a list of graph nodes and then a list of graph edges. This process is managed by a series of Hadoop MapReduce jobs.

Hadoop Vertex Format

The first step is to transform the intermediate form tables into nodes and edges. Since Hadoop requires that data be serialized between steps, I needed a way to represent the graph on disk. I chose to use a Thrift Struct.

Here’s the Thrift definition for a vertex in the ETL pipeline:

struct TVertex {
  1: optional i64 rhinoId,
  2: optional i64 titanId,
  3: optional map<string, Item> properties,
  4: optional list<TEdge> outEdges
}

The rhinoId is the ID used in the intermediate tables. This is the ID we assigned as part of the ETL process. titanId refers to the identifier Titan generates for the vertex. The distinction is discussed in further detail in the Design Decisions section below.

The outEdges field is a list of graph edges with the current vertex as its source (or left hand of the arrow).

The properties field is a map of string keys to typed value as described here:

union Item {
  1: i16 short_value;
  2: i32 int_value;
  3: i64 long_value;
  4: double double_value;
  5: string string_value;
  6: binary bytes_value;
}

Here is the Thrift definition for an edge in the ETL pipeline.

struct TEdge {
    1: optional i64 leftRhinoId,
    2: optional i64 leftTitanId,
    3: optional i64 rightRhinoId,
    4: optional i64 rightTitanId,
    5: optional string label,
    6: optional map<string, Item> properties
}

MapReduce Jobs

Vertex Jobs

The first set of jobs is responsible for converting the intermediate tables into Thrift. Here is an example of the conversions of two tables into its Thrift form.

Now all of the tables follow the same structure and we can treat all vertices the same way!

Edge Jobs

The next set of jobs transforms intermediate edge tables into edges. In practice, vertex and edge conversion jobs can (and are) run simultaneously.

Note that instead of producing TEdge structures, TVertex structures are produced. To facilitate bulk loading, edges are treated like vertices that have no properties. This allows the next stage to treat records from vertex jobs and edge jobs identically.

Graph Combine Job

The next stage is a single job that combines vertices and edges by joining them on titanVertexId. As pictured, both vertex and edge records are combined together so that each vertex appears only once in the output.

Vertex Insert Job

The Vertex Insert Job is the first job that actually inserts data into the graph. There are two important phases of this job, the mapper and the reducer. The mapper, as shown below writes each vertex to Titan.

When it writes the vertex to Titan, it receives an opaque ID that Titan has assigned to the vertex. The mapping between rhinoID and titanId is written out so that it can be used in the reduce phase to be matched with all possible incoming edges.

Next, all of the outgoing edges are written out, grouping by target instead of source vertex. The key is the target (the right hand vertex) rhinoID and the value now includes the source (the left hand vertex) titanId.

In the reduce phase, all of the records with the same rhinoId (including the record showing the mapping to Titan ID) are processed at the same time. The titanId for the given Rhino ID will always appear first due to the use of a custom sort function not described here.

For each subsequent record (now all edges), the rightTitanId is added to the edge and written out. At this point the edges could be written directly to Titan. I’ll talk about why we don’t later on.

Edge Insert Job

The edge insert job is fairly straightforward. The mapper reads in the edges one at a time and adds an edge between the source and target vertices. This is a map only job and there are no outputs other than the insertions into the graph.

Design Decisions

Custom Batch Framework

I initially tried to use Titan’s own batch framework. Two problems arose. First, there was limited room for insert optimizations. Second, it was not compatible with the newer version of Hadoop that we were using.

This also allowed me to devise a custom serialization schema. I chose Thrift because of my previous experience with it and it was already being used for RPC. Alternatively we could have used something like Protocol Buffers or Avro.

The use of an Item union over a plain byte string allowed for property values to be stored compactly on disk while maintain type-safety throughout the graph build process.

Vertex IDs

One of the most (surprisingly) challenging issues I faced was identifying vertices. When a vertex is created in Titan, a vertex ID is generated by Titan through an opaque process. That is, the user does not know ahead of time what that ID could be. In the intermediate tables each vertex is given a unique ID referred to as the Rhino ID, however it is not known ahead of time what the Titan ID will be.

This becomes an issue when as part of a distributed insert, an edge is to be inserted between two vertices. While the edge contains the source and target (left and right hand vertices), looking up each vertex’s Titan ID is not straightforward.

I considered storing the Rhino ID as an indexed property on the vertex, however that requires extra storage that would be wasted after insertion was complete. More importantly, looking up a Rhino ID in the distributed graph could require a network hop to the node that has the Titan ID in question. Instead I opted to use the information already available and perform the insertion in two stages as described.

Separate Vertex and Edge Insert Jobs

While the vertices could be inserted in the map phase and the edges in the reduce phase, I chose to separate them. Initially it was a single job, however it was difficult to reasons about the jobs and their performance.

Additionally, I needed finer granularity over the transaction size. Since the two inserts are separated, I can tune the edge insert job to do fewer inserts in a transaction because edge inserts require more memory as well as I/O bandwidth. This is reasonable given that inserting edges requires random lookups to obtain the source and target vertices. In the future, some optimizations can be made to partition the edge inserts so that they are more cache friendly.

Transactions

Each map task is its own transaction. This means that if a task fails, the segment of data it was working on will be replayed after the original failed transaction is rolled back. A Hadoop node can fail and the insert will recover and rerun the transaction on another node.

However the initial size of a map task was quite large, on the order of hundreds of megabytes of compressed data. Since the Titan client keeps the transaction in memory until it is committed, this was causing Out of Memory errors.

I set the maximum task size to be about 10 MB of compressed data for inserts. This did increase the scheduling overhead and the overall insertion time. However there were no longer Out of Memory errors.

Conclusion

I’ve put the code up on github so check it out!

Thanks to Tony Dong for editing this post.

SMTP Email Relay for GMail (TLS) with Oozie Using Postfix

2013-08-21T00:00:00-07:00

As part of Project Rhino, I’ve been setting up Hadoop along with Oozie to run our ETL pipeline.

Oozie has a cool feature that will send you an email as part of a job flow. However the SMTP setup does not seem to support TLS (PK encryption for SMTP) which GMail and Outlook.com / Live.com require.

What I did was setup a Postfix email relay on one of the servers. This allows for Oozie to communicate unencrypted with the local SMTP server. Then Postfix sends the mail on to the actual SMTP server encrypted.

The team uses outlook.com to host the email for our domain (it’s free!). However this setup should work for any email provider that requires TLS.

Postfix Setup

Install postfix:

apt-get install postfix

Then make a backup of your configuration (/etc/postfix/main.cf) and change it to:

/etc/postfix/main.cf

# The first hop server (change to smtp.gmail.com for GMail)
relayhost = [smtp.live.com]:587
smtp_sasl_auth_enable = yes 

# Location of the password database.
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd

# CAs to trusted when verifying server certificate
smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt

# This trick is from
# http://mhawthorne.net/posts/postfix-configuring-gmail-as-relay.html
smtp_sasl_security_options =

Next we need to setup our authentication. We use oozie@ could be changed to any valid account you have. Make sure this matches the from field you set in the Oozie config later.

/etc/postfix/sasl_passwd:

[smtp.live.com]:587  oozie@tryrhino.com:supersecret

Next we need to run this command to build the password DB:

postmap /etc/postfix/sasl_passwd

Then we can reload postfix:

/etc/init.d/postfix reload

You may also need to change the permissions of the password files.

sudo chown postfix /etc/postfix/sasl_passwd*

Configuring Oozie

When you are looking at the Oozie config, you’ll need to set the oozie.email.from.address to match the one you put in the Postfix configuration.

Good luck!

Automate Your iPod/iPhone/iPad's Media

2012-12-19T00:00:00-08:00

I have a lot of music in iTunes. More than 20k songs. Beyond that I have a ton of audiobooks, movies and TV shows in iTunes for consuming when I’m traveling.

Now my iPhone and iPad of room for substantially less media. Plus I doubt that I’ll need over a month of different music before I can sync to iTunes again.

I used to manually manage what songs, TV shows and movies I had on my devices but that became very annoying and time consuming.

I felt there must be a better way. And there was, Smart Playlists.

iTunes has a cool feature called Smart Playlists which allows you to specify a set of filters to create a playlist automatically.

You can create a smart playlist by going to **File -> New -> Smart Playlist… **

Three Stars

Let’s start with a simple playlist for music. I rate all my music so I only want things with at least 3 stars to appear on my iPhone.

If that playlist is too big you can also opt to only include some number of songs or some size of songs based on a few criteria.

New Music

The 3+ Playlist works great for older music but what about when I get add new music that I have yet to rate?

I have another playlist that keeps the songs added in the last two months that I have yet to listen to. In case I add all of AC/DC I limit it to only 100 items so that my device isn’t overflowing with power ballads.

Syncing Selected Music

We need to tell iTunes to sync only the playlist we created. Go into the music tab for your device in iTunes and select Sync selected playlists, artists, albums and genres:

Then select the playlists you want to sync:

Audiobooks

I really enjoy listening to audiobooks but I hate having to remember to select which ones I’ve listened to and which one’s I want to add to my iPhone.

Thankfully Smart Playlists work for audiobooks too!

I have a playlist which includes unlistened audiobooks from the past year. You could easily setup your playlist to only include a GB of books if you wanted to.

Similar to music sync, we need to tell iTunes to grab the audiobooks from our Smart Playlists.

TV Shows

I love The Wire and watched it on my iPad while traveling. Every time I finished the few episodes on my iPad, I felt it was silly that I had to go and manually tell iTunes I wanted the next few episodes.

With Smart Playlists, anytime I sync my iPad (which happens whenever it’s charging) watched episodes are deleted automatically and new ones are copied on. This ensures that I always have content to consume. This technique also removes last minute syncing I used to do before running to the train station.

Here’s the Smart Playlist for The Wire:

Again, we need to tell iTunes to pickup on the playlist by going to the Movies tab and selecting the playlists from Include Movies from playlists.

Movies

Now this is a little trickier. The problem is that iTunes treats my TV Shows as movies so I can’t simply make a playlist filtered by “Movie” type. My first thought was to only include items that were of a minimum size but some standard definition movies are less than a GB and some HD TV episodes are more than a GB. I could go by duration but something like BBC Sherlock looks pretty movie like in length.

The solution I have is less automated. First I created a regular iTunes playlist that I manually put all of my movies on to. Then I created a Smart Playlist to select a few GB of unwatched movies. Take a look at it here:

Conclusion

Smart Playlists are a great way to automate the selection of your mobile media. Create Smart Playlists and never leave home without an unwatched episode.

Monitor Your Python App With FnordMetric and pyfnordmetric

2012-03-13T00:00:00-07:00

FnordMetric is a super cool (and sexy) real-time event monitoring app.

I’m currently using it for the next version of fidofetch. It’s great for tracking events and monitoring background workers.

It also has a great way to watch how a user uses your application:

You can see what events are caused by a user and all kinds of cool stuff.

I have a fork of FnordMetric here with a basic change to unclutter the users screen a bit.

Configuration

FnordMetric is written in Ruby on top of event machine. It’s super easy to get running.

First you describe a gauge, basically a counter:

gauge :logins_per_hour,
    :tick  => 1.hour.to_i,
    :title => 'Logins per Hour'

Then you need an event handler:

event(:api_login) { incr(:logins_per_hour) }

And finally some way to display it

widget 'API', {
    :title            => 'Logins Per Hour',
    :type             => :timeline,
    :gauges           => :logins_per_hour,
    :include_current  => true,
    :autoupdate       => 60 # refresh graph every minute
  }

You can have multiple time series per chart, auto-refreshing of data, top-lists, bar charts and more.

There’s so many ways to display your data, a full blown example is over here.

Getting Data In

There are a few ways to get data into FnordMetric. You can use the HTTP API, send data over a TCP connection or a Redis queue.

FnordMetric is actually backed by Redis so the fastest way to get data in is to talk directly to the backend.

For this I wrote a python module cleverly named pyfnordmetric.

Just easy_install pyfnordmetric and start using the Fnordmetric module like so:

from fnordmetric import Fnordmetric

  fnord = Fnordmetric("localhost", 6379) # Redis server
  fnord.event("saw_unicorn")

Tracking Users

That’s pretty useful in it’s own right, but one of the cool features of FnordMetric is that it allows you to see what a specific visitor is doing on your site right now. For this there are a couple more features:

fnord.event("login", "session1234")
  fnord.set_name("Stephen Holiday", "session1234")
  fnord.set_gravatar("stephen.holiday@gmail.com", "session1234")

That code will automatically grab a users Gravatar if they have one and display it on the FnordMetric window:

Conclusion

FnordMetric has been a great help in debugging issues with Fidofetch and I want to thank the developers for their hard work. I’d also like to thank Kirk Scheibelhut for telling me about the project.

There’s so much more you can do with FnordMetric. There is support for different charts, three dimensional data and other goodies. It’s definitely worth browsing around the repository for cool features.

I’m working on a few others tools to get more data into FnordMetric with posts in the works.

Let me know what you think.

Stack Overflow Word Trends by Day

2011-12-02T00:00:00-08:00

Here’s a quick hack to see how different words are used over time on StackOverflow.

Inspired by Google’s Ngram Viewer I hacked together this page to show how different words on Stack Overflow vary over time.

The y-axis is the percentage of times the word was used on that day. You can compare multiple words by using a comma.

Graph the occurrences of with smoothing of

The Backend

The backend is a Java app that uses Google’s LevelDB to store all the words and the frequencies per day. It’s not optimized very much at all and the data is definitely too verbose but it’s a quick hack.

I used the data from Stack Overflow Data Dump. I wrote some quick Python scripts to parse the data and get all of the words used in posts (questions and answers).

I have data for 2,3,4-grams but I haven’t loaded it up into the server yet because I want to clean up server and the client first. While this site is on Amazon’s S3, I’m a student and my server is single core Celeron with 2 GB of RAM.

Thoughts?

If you have any thoughts or suggestions, comment in the thread on HackerNews or send me an email (stephen.holiday@gmail.com).

This app got <iframe id='hnbutton' src='http://hnapiwrapper.herokuapp.com/button.html?width=120&url=http://stephenholiday.com/articles/2011/stack-overflow-by-day/&title=Word Trends from Stack Overflow' frameborder='0' height='22' width='90'> </iframe> on HN.

Gender Prediction with Python

2011-06-23T00:00:00-07:00

Sometimes contact information is incomplete but can be inferred from existing data. Gender is often missing from data but easy to determine based on first name.

Some Solutions

One solution is to check names against existing data. A query can be run against correctly know valid name/gender pairs and the gender with the most occurrences of that name wins.

But what about new names and alternate spellings?

What’s in a name?

It turns out that there are features that are indicative of one gender or another. For example, it is more likely that a name ending in ‘a’ is female rather than male. There are also other patterns such as the last two letters of a name.

We could write a series of heuristics to make a determination but that does not seem like a scalable idea. I’d like to be able to apply this approach to other languages and not have to learn the ins and outs of each.

Enter ‘Machine Learning’

What we need to do is figure out which features indicate which gender and how strongly they do so.

I think ML tends to scare a lot of people. When I’m recommending a ML solution to someone, I tend to call it a statistical approach to the problem. So I’m going to call this solution a statistical approach.

What we are doing is classifying the data into one of two categories, male or female. For this I chose one of my favourite classifiers, Naive Bayes. I’m a fan of Naive Bayes because it’s basis is simple to understand and preforms decently well (in my experience).

I’m a big fan of the NLTK’s (Natural Language Toolkit) easy interface to classifiers such as Naive Bayes and it’s what I used for this project.

Overview

First, we’re going to need some data to train the classifier on to see which features indicate which gender and how much we can trust the feature. I grabbed training data from the US Census website and wrote an importer module for it in Python.

Second, we need a feature extractor to take a name and spit out features we think may indicate the gender well. I wrote a simple extractor that takes the last and last two letters and spits them out as a feature as well as if the last letter is a vowel:

def _nameFeatures(self,name):
    name=name.upper()
    return {
        'last_letter': name[-1],
        'last_two' : name[-2:],
        'last_is_vowel' : (name[-1] in 'aeiouy')
    }

Third, we need to test the classifier. We need to be sure that we separate the training data set from the test data set. If we just wanted to do a lookup, a hash table would be much more efficient. We’re interested in the classifier’s ability to determine the gender based on names it has not encountered before. So we randomly shuffle the data and split. I chose to split 80% for training and 20% for testing but that’s something you can play with.

Fourth, we need to learn which features matter. The NLTK provides a nice method which will tell us which features were most useful in determining the gender. This way we can concentrate on features that really matter.

Get The Code

I’ve done a lot of the wrapper work for you and put it up on github. Checkout the gender prediction code here. If you run genderPredictor.py it will automatically train and test the genderPredictor module. You can also import genderPredictor into your own code and run the methods manually.

The most useful method to use within your own code is classify(name) which takes a name and spits out the gender.

You can modify _nameFeatures to play around and test other feature ideas. If you find something that works better, please let me know and I’ll incorporate your idea and give you credit.

Hope this is useful and interesting; let me know what you think.

FidoFetch Architecture

2011-04-25T00:00:00-07:00

What’s FidoFetch

FidoFetch is a RSS/ATOM news reader service much like Google Reader. There are too many news articles and blog posts every second to read it all. In fact, most of it is not something you care about. FidoFetch watches which kinds of articles you like and only shows you things it thinks you will be interested in.

Of course, Fido isn’t really a super smart canine, he’s software I wrote. In fact, Fido isn’t really just one program, he’s a collection of many different programs doing different tasks.

In this article I’m going to take you through the backend architecture to FidoFetch. It’s been several iterations to arrive at this architecture, and I’m sure there will be many more.

Why?

Some people may not like to share information about the inner workings of their projects. I completely understand where they are coming from. However, I’ve been primarily self taught in the area of scalable architectures. What that really means is I’ve been reading about how other websites build and iterate on their architectures. Their openness has allowed me to learn so much from their successes and failures. I really do believe this has helped me succeed so well on my co-op work terms.

My hope is this article will help others learn from my mistakes and make their own really cool projects a reality. If you do find this article interesting or helpful, please let me know.

Goals

When building a system, I think it’s useful to make the goals clear. Ideally they should have quantitative measurements associated with them (as PDENG would profess) so you know when you’ve met them. But, I didn’t know what numbers I wanted to reach, I just wanted to get a working prototype out.

Here were my main goals with the architecture of the system:

1) Reading articles is most important

Being able to read and rate articles should be quick and responsive. The user interface should not lag despite a large load on the server. Having the most up to date articles is not as important.

2) Elastic Server Capacity

I’m a student and while my co-op jobs have been quite good in terms of compensation, I don’t want to waste money on servers I don’t need. Right now this is just a personal project, not a business. So I only want to run the bare minimum of servers to run FidoFetch.

One easy way to do this is to figure out how much capacity I need and run the correct number of servers. But what happens if HackerNews, Digg or reddit finds out about FidoFetch? It’s a fairly computationally intensive application and having many more users all of a sudden could be traumatic to the server.

Another solution is to have the system scale based on it’s current usage. If the system detects a high load, it boots up another instance in a cloud. Getting more capacity through EC2 or Rackspace isn’t hard at all to automate, the challenge is getting the software to make use of the additional capacity.

3) Ability to try different recommendation algorithms

FidoFetch is part of a learning project for me. I don’t know what the best algorithm is for recommending articles in a news feed. But that’s part of the fun, it’s an experiment.

So, I need to be able to try many different algorithms side by side. This would dictate that the architecture not be specific to any specific type of algorithm. This is also a problem I dealt with at Tagged during my internship.

4) Let others design the reader

I’m not a great visual designer. If you look at some of my projects it will become quite apparent. Some of them look the same in Links as they do in a graphical web browser…

So clearly I shouldn’t be the designer for the reader. Many people can make much more beautiful interfaces that would benefit from my platform. To that end, FidoFetch must be able to be used under someone else’s reader app (be it JavaScript, HTML5, BlackBerry, iOS, offline/desktop, Android, etc…).

Solution

Queues

Goal #1 would require the ability to prioritize different parts of the system and hold unimportant computation for later. This speaks to me as a great place to use a queue. A queue would allow feed fetching and recommendation to happen behind the scenes when the server isn’t busy.

Goal #2 asks for the ability to scale server capacity and make use of the added capacity. A queue could allow for many different machines to do distinct units of computation and then store the results. There could be a set of queues for different job types (fetching a feed, recommending an article etc.) and each message in the queue could be a specific job. A set of workers could consume the message/job, do the task and then report the results.

For FidoFetch, I have at least one worker running for each type of job (and thus queue). When more server capacity is created, I just have my system start workers on the new machines that connect to the queuing server. The need for more servers can be detected by analyzing the queue. Two common metrics are the number of items in the queue and the time it takes from entering the queue until the job is processed. Testing the number of items in the queue is really straight forward to implement and is what I use.

Jobs

I designed the jobs in a specific way in order to achieve the scalability I wanted. The key things are:

All messages representing jobs are JSON dictionaries representing the parameters for the job.
A job may be processed more than once by the same or different worker at different times or simultaneously
A job represents a small section of work that will take no more than 4 minutes.
If a job exceeds its run time, it will be placed back into the queue and processed by another worker with a back-off time
If a job needs to be rerun more than 10 times, it is removed from queue and placed in a collection of buried items
Each job can put multiple jobs into other queues for further processing

JSON Messages: I chose to standardize on JSON for the message format because I use it everywhere else in my application and it has libraries in so many languages.

Duplicate Processing: To make the queue more efficient and not waste time, I require only that messages are delivered. I don’t need a guarantee that jobs will only be processed once. This greatly complicates that architecture. It’s easier for to just write the software to handle duplicate runs by default. There’s not really any harm if a feed is fetched twice. In the case of recommending an article to a user, I have it set up in the database so that there are no duplicates on the article_key.

Time Limits: The time limit makes it easy for the queuing server to detect when there is an issue with a worker processing a job. If a job exceeds this time, the queue knows it’s ok to just put it in the queue again.

Back-off Time: This one came from experience. It turned out that if a job failed for some reason or another and was placed back into the queue and run again immediately, it was likely to fail again. Often there was some resource that wasn’t working or being slow (a database or a third party feed). When the job was placed back into the queue, it would just fail again.

Had the job been held back for a few seconds there would not have been an issue. I had the queue wait a bit before reinserting the job into the queue. I had the wait time double each time the job failed. Ethernet does something similar when there is a packet collision. Except Ethernet uses a random back-off time otherwise the two machines that detected the packet collision would just keep sending on the same intervals.

Retry Limit: This one was also learned the hard way. For a while I had the database misconfigured and it would get very slow after a while (24 hours+). Eventually it would stop serving requests and the worker would detect the failure immediately and tell the queue it failed. The worker would get another job right away and try again. The job would fail and then be in back-off mode.

Once the timeout was complete all the jobs would come back again and hammer the database server even further. This made it very difficult to go into the server and restart the database. So I implemented a limit of retries. When the limit is reached, the jobs go into a special buried queue. Once I fixed the problem (in this case restarting the database), I could just kick the jobs back into the regular queue.

Chain of Jobs: Now this part has been really helpful in keeping FidoFetch running. I’m going to need a separate section to explain this. For now it suffices to say that each worker has the ability to insert subsequent tasks in other queues for processing.

Chain of Jobs

Chaining of jobs is a very important part of FidoFetch’s architecture. It’s a create model for writing distributed processing software in. It’s neither a new concept nor a complex one, but it is very cool (in my opinion).

In FidoFetch, the process to fetch a feed and then send it to a user’s to read list is broken into several parts. Six to be exact. Each part deals with a specific task that is separable from the others. It does not depend on what the other workers are currently doing. Essentially, the output of each job is the input of the next.

Do_Job3( Do_Job2( Do_Job1(user_key) ) )

The current job places the result of the job as a new job on the queue for the next job type. This can be thought of as a chain

user_key -> Worker1 -> Worker2 -> Worker3

Now, some jobs will actually have multiple return values that will create multiple jobs on the next queue. For example, returning a list of all the user’s who need to be notified about a new article will return multiple users. The actual notification of a user requires only the user_key of that user and is independent of the other users that need to be notified. Thus multiple jobs are placed into the queue, one for each user_key.

By separating the overall task in multiple parts, we gain a few things:

Failure of one part does not require loosing the work so far
Ability to distribute a seemingly single task into multiple pieces
Ability to write different workers in different languages (maybe C++ for something that needs to be high performing)
Ability to add more workers for a specific task that either happens more often or takes longer.

It’s extremely useful. Right now everything is written in Python but I do soon plan on writing some workers in C++ in the future. The ability to add workers for a specific task has also been especially useful. The feed fetching operation requires very little CPU but just time for the remote server to respond. If I had just one worker doing feed fetching, it would stall the whole process if one server was particularly slow. I run several feed fetching workers at the same time so faster feeds get processed quickly.

FidoFetch’s Chain

Now I’m going to delve into how I decided to split up the tasks. These are not all the job types that are in FidoFetch. There are some other tasks that get run in the background, but they are not in the main chain. I’ll cover those tasks later.

Entrance Points

There are two ways to kick off the process. The first way is on login. When a user logs into FidoFetch, the system adds a job to update their feeds with the updateUserFeeds task type. There is probably already stuff for the user to read, but lets ensure they have lots of interesting stuff to read by rechecking all of their feeds.

The second way is through a cron job. Wait a minute… weren’t we designing a distributed system? Cron seems very non-distributed doesn’t it…

Well, yes. This is true. But I have reasons: 1) I want it to start automatically and 2) queuing the same job does not violate the principles we set in Jobs section above. So if I have all the servers just queuing jobs, the system should be prepared for this. The system is, which we’ll talk about later in the updateFeed worker section.

So back to the entrance point. Every 20 minutes (totally customizable), the system queues a job for each feed that there are subscribers (a list I maintain, lookups are in constant time). This happens with a python script. The twenty minutes is just a reasonable number I picked.

updateUserFeeds(user_key)

The updateUserFeeds task’s name is somewhat of a misnomer. Technically it doesn’t update the feeds of that user. We didn’t design this who chaining system for nothing!

What actually happens in this job is quite simple. The worker is passed (as a part of the message) the user_key of a user. The key is an arbitrary string (I choose to use a hash of the username, but the worker doesn’t care). The worker then looks up in the database all of the feeds that the user is currently subscribed to (in constant time).

Then the worker takes a look at all the feeds (each is a constant time lookup in the database) and reduces the list to those that haven’t been updated in over 15 minutes. The 15 minutes is again arbitrary, but is a reasonable number for current operation. Then, for each feed remaining, the worker places a job in the queue for updateFeed().

updateFeed(feed_key)

This worker does what it’s name implies. It updates the feed given to it. First it looks up in the database to see if the feed has been checked in the past 15 minutes. If it has, then the worker is done with the task.

The reason the worker looks up the feed’s last update time in the database again is to be consistent with our rules for jobs. Earlier we said that a task it not guaranteed to only be run once. This means that two different workers (maybe to different machines) may receive the job. To ensure that we don’t hit the remote server too often (and waste our time), we only check if the feed data is at least 15 minutes old.

If the feed hasn’t been updated recently, the worker fetches and parses the feed. In past implementations of FidoFetch I wrote my own RSS/ATOM processor. I think it’s something many naive developers have done. Don’t bother. While RSS and ATOM are standards, there implementations are far from standardized across the web. There are so many different styles of generating feeds and so many malformed feeds. It’s not worth trying to figure out all the nuances of feed parsing.

This time I was smart, or rather a co-worker (a really awesome intern from a university in New Mexico) was for me. He pointed me in the direction of the Universal Feed Parser. It’s a feed parser in python that does an awesome job. It’s really quite an awesome piece of software. The interface is also quite intuitive and pythonic.

So the worker grabs the feed and checks if each article exists in the database already. The key for the articles is a hash of some of the meta information of the article. Often the guid or url of the article is used. This makes lookups very fast.

Now, for each article that was new, the worker saves the article data and queues a job in the prepareArticle(feed_key, article_key) queue.

prepareArticle(feed_key, article_key)

This workers is the first to do something a little more interesting than the previous ones. This worker is given an article for a given feed to work on.

It grabs the article from the database and the processes it. Internally it actually runs through a list of processors (which are just python modules I wrote). This is part of one of the requirements we discussed earlier, the ability to try multiple learning algorithms at the same time.

Different algorithms require different pre-processing of articles. A common one is tokenizing and generating n-grams. This processing is independent of the user the article is being recommended to and is static. That means that pre-processing an article is a perfect sub task to be done ahead of time.

Once the preparation is complete, it just enqueues another job into the notifySubscribers(feed_key, article_key) queue.

notifySubscribers(feed_key, article_key)

This worker is pretty simple. It looks up all the users who are subscribed to a given key. I have such a list stored under the feed_key in the database so this is a constant time operation.

Then, for each person in the list, the article is added to the recommendArticle(article_key, user_key) queue.

recommendArticle(article_key, user_key)

This worker is the main point in the application. This is what makes FidoFetch smart. This is where Fido decides if an article is something you’d be interested in.

This worker runs through each recommendation algorithm (each is a python module) and calculates the recommendation. It stores the recommendation in a articlesToRead list keyed by the user_key.

If this worker is run more than once, the worker catches this by seeing it’s already rated this article for this user.

This worker does not report to anything else. The chain is done!

Other Tasks

Not everything fits into the chain. There are a few tasks that need to be completed but don’t depend on other tasks. In other systems this may include cleaning up temporary files, updating server code etc.

regenerateModels(user_key)

For many algorithms, there can be a huge performance boost in pre-generating some sort of model every once in a while. Not all algorithms can do this but a lot can. So each algorithm has the option to add itself to the list of things that should be run periodically.

These tasks tend to be really heavy. Regenerating just one of the models for my interests takes about 3 minutes if there is no other load. Putting this task in a distributed worker queue makes a lot of sense.

markArticleRead(article_key, user_key)

Once you read an article, there are a lot of things that change than just your articlesToRead list. There are other indexes that can quickly show what everyone else thought of a given article, what people think of certain feeds in general and so on.

Previously, all these indexes were updated as soon as you click read. But that left the interface feeling slow. Since reading articles is the primary use for FidoFetch, a snappy interface is important. Running these less important index updates asynchronously works great.

Queuing System

So we’ve discussed at length how FidoFetch uses queues, but what does FidoFetch use for it’s queuing service?

I evaluated several options for FidoFetch. Amazon’s Simple Queuing Service, SpringSource’s RabbitMQ, ZeroMQ, Apache’s ActiveMQ and even using a database as the queue.

Once again one of my coworkers suggested a product he has used in the past, beanstalkd.

Beanstalkd

Beanstalkd is a simple and straightforward worker queue. Now technically it is not a distributed worker queue. However, workers can connect to it from anywhere and start processing jobs.

My design requirements do not require the absence of any single point of failure. If this were to be the case, I would have to use one of the other queuing services. FidoFetch was designed to be scalable for my needs.

There are clients for Beanstalkd in many languages and it has the basic features I need. I use beanstalkc for python.

I also wrote a simple monitoring script for ganglia.

The Database

FidoFetch needs somewhere to store all the articles, subscriptions, rating data etc. The design of the system requires multiple servers be able to access the information in a scalable fashion.

Since the server farm needs to be elastic, the database needs to be able to grow and shrink in terms of nodes.

Riak

Riak is the database that powers FidoFetch, there were a few reasons I choose this database:

Free (I’m a student…)
Distributed
Fast lookups on key
Elastic (Nodes can leave an join the network without manual rebalancing)
Fault tolerant (Nodes can die and come back, downloading only the changed data)
Map/Reduce tools
Intuitive API

Riak is really quite awesome. It’s written in Erlang and is really straightforward to use. I do admit I had some trouble with the Map/Reduce functionality but I’ve managed to find workarounds.

I recommend you check out the riak wiki for more information.

Conclusion

I’ve covered what I wanted to discuss about FidoFetch’s architecture. I hope you found it interesting despite the length. I hope that you will also find it useful or that it at least gets your brain churning.

So far FidoFetch’s architecture has been performing great. I expect there will be further iterations.

Thanks for reading, please shoot me an email with any thoughts.

Introduction To MapReduce

2011-03-12T00:00:00-08:00

Sometimes datasets are a little larger than what we can easily process on a laptop. In these cases it’s often helpful to harness the power of many machines to do processing of data.

MapReduce

MapReduce is a programming paradigm invented by Google to make it easier to write distributed software using common programming constructs.

In python it’s not unusual to ‘map’ a function across a list:

>>> some_list = ['Hi my', 'name', 'is', 'Stephen Holiday']
>>> map(lambda x: x.swapcase(), some_list)
['hI MY', 'NAME', 'IS', 'sTEPHEN hOLIDAY']

Similarly you can reduce a list by iterating over it and returning a single value each time:

>>> reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) 
15

It turns out that these two concepts make it very easy for distributive systems to break up tasks. Since each map requires only the current item in the list, the above map operation could have occurred on 4 different machines.

Contrived Example

Let’s say we have a list of a million numbers that we wanted to sum together. Unfortunately they aren’t stored as integers but as strings. Let’s also say that converting these strings to numbers is a really intensive task.

Thankfully we have 4 machines that can do our work for us. In order to harness the power of the machines, we need to some how split up the tasks.

Converting a string to an integer does not require knowledge of the other strings that we wish to convert. It’s an isolated operation to convert the string. This means we can split up the list.

We can also note that summing the integers does not actually require knowing about all of the integers at once. We can apply the summing operation multiple times in order to get the right result. Addition is commutative and associative. This means that we can apply the addition in any order we want.

 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10
=(1+2) + (3+(4+5)) + 6 + ((7+8) + (9+10))
etc...

We could use a worker queue or something similar, but that really requires a lot of customization. Using a MapReduce infrastructure, we can easily deploy new jobs to the cluster.

Map

So our map function could be:

def map_convert_to_int(input_string):
    return int(input_string)

Now, let’s say we had some distributed map reduce framework (cough hadoop), it could break up the list and send a portion of the list to each machine.

For simplicity, we’re only going to use the numbers 1-10, but this can easily be expanded.

list=['1','2','3','4','6','7','8','9','10']
machine_1_list=['2','1','7']
machine_2_list=['10','6']
machine_3_list=['3','5']
machine_4_list=['9','4','8']

So as you can see, each machine gets some subset of the list. The machine doesn’t care which ones it gets, the order, or the size since each map operation is independent.

Once each machine gets it’s list, (or as soon as it receives it’s first item) it can start running the map function. Each machine now has a list of integers.

Reduce

Now we need to sum up the numbers.

Our reducer could look like this:

def reduce_sum(accumulator,new_value):
    return accumulator+new_value

This would be run against the list, with an accumulator value. Since addition is commutative and associative, it doesn’t matter what subset or order we run the reducer.

So, each machine could run the reducer on it’s own list currently stored in memory and return the single sum. Then the master machine that started the job could just run the reducer on the list of results it got back.

Tada! Simple distributive programming.

Conclusion

This article covers the basics of MapReduce. Different implementations have different additional features, but the basics are still there.

This article is just an introduction and later I will write more articles on practical uses of MapReduce.

I hope this was interesting to you, let me know what you think.

Mining Jobmine - Fall 2010 - Part 3

2011-03-01T00:00:00-08:00

This is the third part in a multi-part series, read the first part here and the second part here.

JobMine is the tool we use at the University of Waterloo for our co-op. For each job in JobMine there is a description of the job written by the employer. In this article I’ll dig into the job descriptions applying some of the techniques I talked about in the earlier articles.

##Unigrams## Just like I did in the first article, I tokenized descriptions and collected the most common ones.

Ugh, that’s not very useful at all! The most common tokens seem to be those that are most common in English in general. Thankfully, we have lists of such common words. They’re called stop words.

##Stop Words## Stop words are common words that don’t really give any insight into the text, like ‘the’, ‘and’, ‘or’ or ‘but’. Often people trying to classify text or extract meaning remove stop words from the text. The idea is that removing the stop words reduces noise in the data and compacts feature space.

I looked into the effectiveness of stop words in my Twitter Sentiment Extraction project.

Now I need a list of stop words to remove. Since I’m a fan of the NLTK (Natural Language Toolkit) for python, I’m going to use the built in stop word list.

###NLTK Stop Words### I’ve reproduced the list here so you can get an idea of what will be removed. Again, please checkout the NLTK website for more information.

i           their	    doing	above	each
me          theirs	    a	    below	few
my          themselves	an	    to	    more
myself	    what	    the	    from	most
we          which	    and	    up	    other
our	        who	        but	    down	some
ours	    whom	    if	    in	    such
ourselves	this	    or	    out	    no
you	        that	    because	on	    nor
your	    these	    as	    off	    not
yours	    those	    until	over	only
yourself	am	        while	under	own
yourselves	is	        of	    again	same
he	        are	        at	    further	so
him	        was	        by	    then	than
his	        were        for	    once	too
himself	    be	        with	here	very
she	        been    	about	there	s
her	        being   	against	when	t
hers	    have    	between	where	can
herself	    has	        into	why	    will
it	        had	        through	how	    just
its	        having	    during	all	    don
itself  	do	        before	any	    should
they    	does    	after	both	now
them    	did

##Unigrams - Stop Words Removed## So I removed all tokens that were in the stop word list inside my tokenizer function.

Ah, now that’s better. Removing those stop words really seems to help. I did not expect for ‘work’ to be the most common token in the job descriptions. ‘Experience’ and ‘development’ were not surprises.

##Bigrams## That covers what I talked about in the first article. Now I’m going to address the topics covered in the the second article, bigrams and trigrams.

Without further ado, here are the most common bigrams:

Hmm, same problem as before. While some of the bigrams may be interesting (‘experience with’ or ‘knowledge of’), I think there’s some more interesting bigrams that are hidden in the noise.

The bigrams that don’t contain stop words seem to be much more interesting. Seems like stop words are working.

##Trigrams## Previously, trigrams were not that interesting in the titles. I hypothesized that this was because the titles are so short, usually two to three words (I should run some stats on that…). Descriptions are often much longer so maybe they will be more interesting.

Now let’s try removing stop words…

There’s some weird stuff near the top of the range in the trigrams. The people who run JobMine (CECS) put a notice in the job descriptions of positions outside of Canada. It’s good to let people know about the possible issues with working internationally but it seems to mess with the data. We are really interested in what employers write in the their job descriptions. So I’ll try to remove them.

##Trigrams without Warning## So I modified my script to remove the CECS warning from the job descriptions and plotted both charts. (I really should have wrote my script to produce a chart at the end instead of sending the results to Excel)

Now let’s try removing stop words…

I’ve censored one of the trigrams as it would have given away one of the companies on JobMine and I don’t want to get in trouble with the university. In the future I’m going to try and play with public data sets so I don’t have this issue.

##Quadgrams## Trigrams seem to be interesting, how about setting n to 4 (quadgrams?). Quick changes in the script and away we go…

Now let’s try removing stop words…

As you can see, the larger the value of n, the less frequent the top grams are. This makes sense as the chance that multiple employers will choose the exact same sequence of n words in a job descriptions decreases with higher n. This is kind of like the randomly typing monkeys problem. However, I wouldn’t want to compare the employers to randomly typing monkeys…

##Next Time## Looking at the descriptions seemed interesting to me but I’m definitely not done messing with this dataset yet. Some other things I’m thinking of doing next include lengths of job descriptions, writing level (though that could be sticky) and maybe some other stats that step away from the text itself.

Hope you enjoyed this article, shoot me an email if you have something to say.

Mining Jobmine - Fall 2010 - Part 2

2011-02-22T00:00:00-08:00

This is the second part in a multi-part series, read the first part here.

JobMine is the tool we use at the University of Waterloo for our co-op. In this article, I show some more insightful data based on n-grams.

##n-grams##

n-grams are sequences of n consecutive words. They are quite simple but can provide a lot of insight into a collection of text. Let’s take tweets from twitter as an example. If a tweet was ‘I don’t like customer service’ then the tokenizer would produce a list ‘I, don’t, like, customer, service’. The classifier would then ignore the fact that don’t and like are connected and only have the correct meaning when they are considered together. The classifier may see the ‘like’ token and classify the tweet as positive when clearly it is negative. In order to rectify this, the features (and subsequently the classifier) must take into account the positions of words relative to each other. A simple way to do this is with n-grams.

Most Popular Words in Titles

In the first article I used 1-grams or uni-grams from the title.

The chart is reproduced here:

Now that is all very interesting but it does not provide me with enough detail. I know most jobs in the categories I’m looking for involve software, but how many are listed as “software tester” and how many are “software developer”?

##Bigrams## So I ran through the titles and collected the top 25 bigrams…

As you can see, the highest frequency n-gram is “software developer”. Bigrams can obviously provide a lot more information than single words. I also find the distribution very interesting, it seems similar to that of the earlier article.

If I had more data, I’d really like to show the charts for the different job levels (In JobMine there are Junior, Intermediate and Senior classes on jobs).

##Trigrams## Bigrams were definitely more descriptive, so how about trigrams?

Hmm, not really that interesting. Since titles are short, the number of trigrams that reoccur in titles is small.

##Next Time## So we’ve looked at the titles, but what about the job descriptions themselves?

The next article will go into the meat of the job descriptions. Read it here.

Mining Jobmine - Fall 2010 - Part 1

2011-02-03T00:00:00-08:00

JobMine is the tool we use at the University of Waterloo for our co-op. In the system, there are thousands of job postings for various positions. Having access to a big database always gets my brain churning with possible alternative uses for the data contained within.

First, I should credit Lisa Zhang with the inspiration for this idea (and the title), she did some really cool stuff with JobMine and you should check her blog out.

Most Popular Words in Titles

The first thing we (students) look at when browsing JobMine are the job titles. So naturally the titles affect which jobs we click through to read and then apply to. The job titles have a lot of weight on our application process.

I should note that I only analyzed the jobs in my field (jobs in the Computer Engineering or Software Engineering categories). I also had some additional filters on so this is not exactly a perfect sample, but I think it’s interesting nonetheless.

So I went through the jobs, grabbed the title and ran a custom tokenizer to extract the grams. Here is a chart of the most popular words in job titles.

And here are the top 100:

Discussion

So this interesting but not terribly surprising. It’s logical that the top word is “software” and the next “developer”. Something I found interesting was the choice of “engineer” versus “engineering”. Engineering occurs more frequent than engineer by a factor of almost 20 times. Technically in Ontario, having a job title of something engineer is against PEO and you could be prosecuted for it (according to my Engineering Ethics and Law class). Perhaps this is the reason for the difference. On a side note, “Software Engineering” has always looked awkward on a resume to me.

I was surprised by the low number of QA jobs, but then again, I think I had the filter to remove junior jobs and they tend to be more QA oriented.

There are definitely some interesting things that can be taken from this analysis but there is still much more to look at. My next article continues the analysis of titles with different n-gram sizes.

New Site - Hello Hyde!

2010-11-21T00:00:00-08:00

I’ve switched to a new method of creating this site, now I’m using Hyde…

##Before## I used to write all my site content in a giant PHP file with if statements for which page to show.

I admit, it was ugly.

First off, everything was in one file and was a mess. The real reason for using PHP was so I wouldn’t have to replicate all the boilerplate HTML on every page. I didn’t like the fact that I was using PHP just to switch pages, but hey it worked.

##Problems##

I didn’t want to go with a full blown blogging/cms framework like Drupal or Wordpress. I just wanted a dead simple way of showing people what cool projects I was working on.

I used to blog, 6 years ago. But I don’t want to blog. A blogger should produce quality content regularly. I want to build things and create cool things. Blogging just uses up my time.

But, I still sometimes Have something to say in a blog like format. And this is what you are looking at. The site has a static focus. Things don’t update all the time, but content does change. That’s one of the reasons this section is ‘Articles’ and not ‘Blog’.

##Make a wish## There had to be a better way than an ugly PHP file. So I did what most people do when they wish a piece of software, I googled it.

And magically someone had built it. It was called Jekyl.

Jekyl is awesome. But it’s written in Ruby. And I like python so I made a wish to google again and found my answer…

##Enter Hyde## And that answer was Hyde. It was essentially Jekyl but in Python! Perfect name.

So instead of trying to rehash why Hyde is awesome I’ll point you to Steve Losh’s post.