Tugberk Ugurlu @ the Heart of Software

My Mental Model on Choosing a Database for a Particular Problem

Tugberk Ugurlu — Thu, 08 Apr 2021 02:37:00 +0000

Choosing a database technology for the problem at hand is a multifaceted problem. There is usually no single correct choice, but there are surely multiple wrong choices. Depending on the problem you are trying to solve within your unique set of constraints, getting this choice wrong could have significant negative implications on your users and business.

Over the years, I have developed a mental model on how I go about to judge whether a certain database technology is the correct one for the problem at hand. This worked well for me when I am designing systems as well as peer reviewing other technical designs. I wanted to share this here in case it helps others, while also serving my selfish need to document this somewhere.

I have put this together quickly. So, some ideas may not be solidified as effective as they could have been. Please drop a comment if you have any questions, or suggestions.

Dimensions to Think About

There are various dimensions that I consider to be able to make an informed decision about whether a certain database technology fits into the problem which I am trying to solve. These set of dimensions likely to be unique according to the environment and constraints you are working within. However, I have observed some common ones which always came up no matter what type of the environment I was trying to make the choice. I will list them here and explain what I mean by those. However, just keep in mind that this is by no means an exhaustive list. So, if you can think of any others based on your own experience, I would like to hear and learn about them 🙂

Access and Write Patterns

Before you make a choice about a database technology, it's immensely valuable think about your access and write patterns, how the data is written and how you want to query that data. This is to be able to increase your chances to make a correct choice, and it is likely the biggest deciding factor that will let you narrow down your choices as database technologies often differentiate themselves around access patterns, and how the writes are processed.

Write pattern is usually the easy part since the writes usually come in a defined shape, and unlikely to change drastically throughout the lifetime of the application. Therefore, writes usually dictate in what shape you want to keep the source of truth. I will talk more about the write pattern within the scalability section below, which is where the real challenge with writes come into the picture. However, there are still a few other considerations you need to keep in mind with writes. For example, what happens if you happen to receive the same write concurrently? You need to have a strategy for dealing with this situation through techniques like optimistic concurrency control, and it's critical to know how your data storage technology can help here. The other important but often overlooked aspect is the idempotent writes. If you want to ensure that the writes are processed exactly once, you might want to employ techniques to mitigate these. For example, DynamoDB's ClientRequestToken parameter acts as an idempotency token and a great example for an out of the box support for this type of problems.

As for the access pattern, it is all about how you want to be querying the data. The reason why this matters, and differs from the write pattern is that it's almost always the case that we need to query the data with a different need in mind compared to how it's written, and more importantly the need for your read pattern often changes (while usually needing multiple patterns per data source). The fact that data access patterns evolve in a much greater velocity than the writes is a too complicated topic to get into here. I will leave CQRS here as a topic for your to explore.

Take the example of a coffee shop which receives orders from each customer, and the loyalty program which needs to give customers points based on their orders and value of those orders. Here, writes are simple: an "order entity" is written into a data storage system and grouped within the orders "bucket" per each customer order. However, the read here is more complex. We need to be able to calculate a loyalty point score per each customer based on their previous orders, and we also need a way to perform this efficiently (i.e. we don't want to scan the entire "bucket" for this).

This was a sort of an easy example, and I am sure you can emphasize here that a lot of data storage systems already have a solution for this problem. So, it may not make much of a difference if your needs are this simple. However, as your access pattern needs get more complex, more differentiation this aspect is going to be putting its weight on your decision making. Take the example of a data access need where you need to find out conflicting events within a given location. The write pattern here is likely to be simple, but the access pattern is a unique one. Not many data storage systems can index for this type of query. So, you need a specific data storage tech to be able to pull this off, or you need to get creative with how you store the data with the potential cost of lower throughput and higher latency on writes.

Another example here is when you need to perform a full-text search over a piece of text across all the rows. Depending on how you want to perform your search (e.g. contains, starts-with, exact word match, etc.), how you want to normalize the search experience (e.g. case-sensitivity, ignoring some stop words such as "and", "or", etc.) and which languages you want to support, this can get pretty complex quite easily. Therefore, the need here needs to be understood before you can make a choice about the data storage technology.

Consistency

The consistency of the data is another pretty important aspect to worry about. For the cases where you attempt to write the data into the database and get a successful ACK back, the databases out there have set different expectations when you happen to read the data again immediately. For example, your data might be written and stored in a durable way. However, you may not see the data you have written immediately, but you are likely given a promise to see the data eventually. This characteristics is commonly refereed as eventual consistency and can have a big negative impact on the system we are designing unless we understand the implications beforehand. There are several reasons why a data storage technology is providing eventual consistency. However, fundamental trade-off here is that you gain either higher availability and/or throughput at the cost of consistency.

In case where you are working against a single node data storage system, the reason you might be facing eventual consistency could be related to the fact that the indexes might be updated asynchronously, meaning at the time when you received an ACK from the database about your write, the index which your query is hitting may not be updated. This is employed to increase throughput as waiting on an index update will take time, and decrease the number of writes you can handle per second.

However, in the case where you are working against a multi-node database structure, where you have replicas, the reason why you might be facing with eventual consistency is likely to be related to the fact that the data hasn't be replicated to the replica node which is being used for the query. In this case, you are actually gaining three benefits:

further availability (by having the replica which you can failover to),
higher read throughput (by having the replica which you can use to read data from),
and higher write throughput (by not replicating the data to each replica synchronously).

Each database is designed work differently with this structure, and it's common to have systems where the replication is performed asynchronously by default to increase write throughput at the cost of eventual consistency. Understanding how each database works and identifying your needs are critical here. In many database systems, there is a way to influence the default replication behavior. For instance, Redis provides a WAIT command which can be used to block the current client until all the previous write commands are successfully transferred and acknowledged by at least the specified number of replicas. While this does not make Redis a strongly consistent store, it increase the changes of data availability to a much higher state, and also increases the changes of consistency as well. Note that this comes with a few costs:

The writes will take more time to be executed due to the nature of synchronous replication.
You are performing this at the cost of 'P' from the CAP theorem, which means that your system is no longer Partition tolerant.

MongoDB also offers a similar configuration which is called write concern, and can be set per each write.

There is no simple answer here to indicate one is better than the other. It all depends on your needs, and what type of trade-offs you are happy to make.

Fault Tolerance

What I mean by fault tolerance here is whether the database can gracefully handle one or multiple nodes going down. When a replica is down, this is usually not a big issue within the setup. A new replica can be spawn up, and can catch up with others through the log of the data storage system (which can be unique to each database technology). However, when the primary node, which handles the writes, is down, it's a bit more serious as there is only one node that handles the writes. In this case, what usually happens is that a new primary is elected from the available replicas through the consensus protocol employed by the database technology. This is what we call failover.

As nearly every data storage system comes with a replication setup and a failover approach, this is less of an issue today, but can still be a deciding factor. The differences usually come down the following points:

What type of protocol the database system employs to handle consensus, and failover. Raft and Paxos are a few of them. You often don't care about the actual protocol used but you care about the implications such as how hitting common issues like split-brain is handled, how fast the failover detection can happen and whether you have any configuration point to influence this, etc.
How the clients are handling the failover, e.g. how they perform DNS caching, what is the default timeout to detect a node which is down, etc.

Scalability

One of the most critical parts that can be a differentiator for a database choice. What I mean by scalability here is in terms of both for reads and writes.

Read Scalability

I believe read scalability is largely a commonly solved problem across databases, and easy to reason about by users with some trade-offs such as eventual consistency. So, read scaling will likely not going to be the differentiator. That said, there could be some constraints put by a particular database around read scaling which might be enough to put you off. For example, AWS PostgreSQL Aurora has a hard limit of 15 nodes to be had at max per cluster. That usually doesn't become an issue but I can see how it can be deciding factor.

Write Scalability

On the other hand, things get really interesting when the topic comes to write scaling. Scaling writes have always been much more challenging than scaling the reads, and one of the reasons for this is the shape of the data we have historically stored (i.e. relational data), how we relied on some database features which made this super hard (i.e. database-generated sequential IDs), and how much you as the user of a particular database technology needs to do to scale your writes. However, with the rise of NoSQL databases, the write scaling became much more easier. This waw not because they came with a magical, breakthrough technique or anything. This type of databases just restricted how we were allowed to write the data at the first place, and educated every user about eventual consistency.

If we turn back to our today's World, the common way to solve the problem of write scaling is through sharding, which is the processing of horizontally partitioning the data across multiple different nodes where each of these nodes hold only the subset of the data. The process of how to determine which node holds which data is likely to come down to the user's choice based on the chosen sharding strategy. However, default is likely to perform the data distribution based on the following calculation with simple systems: generating the the hash of the identifier of the data, and taking the modulo of it against the number of shards. Which hashing algorithm to use here is also database technology specific. For example, Redis uses CRC16.

If you identify that you need to perform data partitioning as one node won't be enough to be able to handle the write load, you will be narrowing down your choice significantly as some databases will not get your this functionality out of the box.

However, the real issue starts when it comes to identifying the details of how each database technology makes the sharding work. One interesting aspect here is the resharding part. When you reach to a certain scale with your initial setup (e.g. 5 shards, each having 3 replicas), you might be in need for scaling further. In that scenario, you will be looking into options to add a new shard node to your setup. This will mean that you need to perform an action called resharding, meaning that your data set will be redistributed across nodes, and at least some of the data if not all within the existing shards will move to the newly created shard. This creates a problem with the sharding approach I described above as the clients know that there are only 5 shards, and which node the data needs to be directed to is calculated based on this value. Adding a new shard will start changing things drastically, and will likely introduce downtime to our system unless we are happy to serve the wrong outcome to our users (which we almost never are). More importantly, more data we have across all the nodes, more the downtime we will have which is not a very good scenario to be in.

This is where more clever distribution techniques are employed to allow for zero-downtime resharding to be performed. For example, Redis uses a slot based key distribution model which works really well under resharding scenarios. On the other hand, Cassandra uses a technique called consistent hashing which allows distribution of data across a cluster to minimize reorganization when nodes are added or removed. I don't have much experience with a structure like this but it seems like writes continue to be accepted successfully during the resharding phase.

Elastic Scalability

Also, you should not just be thinking about the today's load demand when it comes to scalability but also project the next 6, 12, 24, 36 months. This will let you see whether your system is likely to meet the demand of the future with your current solution or likely to require a redesign. A good example why this is important could be related to a case when you decide to work with a database from a cloud provider, for instance, PostgreSQL Aurora from AWS. Let's imagine that based on the demand on your read-heavy system today, you can get away with having 5 largest aurora instances where you have one master, and 4 replicas. As Aurora doesn't support data partitioning at the time of this writing, your only choice is to increase the replica size to be able to scale your reads further. This sounds great, as our system is read-heavy (for now, let's assume writes are not an issue, e.g. we have a way to throttle them, etc.), and we can keep adding replicas. However, if we were to project our load against to expected company growth, we may find out that in 12 months, we will be needing 20 replicas instead of 4. That's a problem as Aurora doesn't allow more than 15 nodes per cluster. So, this will likely require a redesign of your system one way or another, and most importantly, you may find out about this limit when you happen to hit the ceiling which could be too late to save the day.

Sometimes though, it’s acceptable to commit to a redesign if we were to need a larger scale, for several reasons, e.g. development velocity. However, this needs to be an explicit call and your team needs to have a shared understanding on this.

Maintenance and Observability

Regular backups, and database engine upgrades are big part of your maintenance. Backups are usually handled pretty well with nearly all databases out there today. However, version upgrades can be a pain and often the issues associated with it are not detected till the first upgrade needs to happen. Finding out the procedure for these tasks before you actually need them are critical, and can be a deciding factor. For instance, if you have no choice but to introduce downtime during version upgrades, it could mean a significant impact to your business. Therefore, that risk might need to be assessed prior to your choice.

Observability is another part which you need to be absolutely on top of from day one. You should be understanding what's happening within your database servers, and have your monitors lined up to detect the potential issues ideally proactively. For example, you need to be able to reason about when you start seeing that your database query latency started to increase for certain queries. Is this happening due to hitting the IOPS limit? Are we saturating other resources of the server such as CPU due to intense indexing caused by elevated writes? Without proper observability, you will be in hell to figure these out. Luckily, many data storage systems are offering near-perfect metrics to gain as good observability as possible.

Related to point above, it's also worth thinking about the operational aspect. For example, when you are on-call, and you are getting paged related to your data storage system, you want to have some actions that you can take with minimum effort and negative implications. This may not be a problem for you today, but it will surely become a problem at some point (remember Murphy's law: Anything that can go wrong will go wrong!).

Self-healing

As mentioned above, anything that can go wrong will go wrong. Paging the on-call engineer to fix an issue related to a database is not the ideal scenario. If you can make your system work within a self-healing mode, you should absolutely invest in doing so. What I mean by self-healing is related to scaling to the elevated need to prevent potential resource saturation issues, healing from failures by performing failovers, resharding when writes start to become a problem, and doing all without needing any intervention.

Some systems offer some of these self-healing operations out of the box. For example, with AWS Aurora, you can configure replication auto-scaling based on several metrics (e.g. average CPU time of all replicas, etc.). With DynamoDB, resharding is performed for you without you knowing nothing about it, which is pretty epic.

Judging the self-healing aspect might bring a pretty big weight for a certain database choice, and could be a differentiating factor. Also, by investing your effort into investigating this area will pay off quite quickly, and you will thank yourself when you go on-call 🙂

Cost

This can be an important distinction especially when you are working with a cloud provider. Depending on your needs, sometimes the cost will make you shy away from a perfect data storage system. For example, consider a write-heavy workload with a predictable, steady amount of writes for the foreseeable future. Considering this requirement, DynamoDB sounds like a near-perfect choice if you are heavily using AWS. However, DynamoDB charges are heavily influenced by write capacity units (and a few other aspects) which is based on the number of writes and how large the data which you are writing into it. The best thing with DynamoDB is the fact that it scales to your needs elasticity but in this case, we already know our load and it's steady. Therefore, a PostgreSQL Aurora instance with a known capacity to handle the predictable load will likely perform much better in terms of cost as the price is set based on the instance size, and amount of data you will be storing.

I would suggest to pay attention to this area, especially if your scale requires it and you have alternative choices. If you happen to narrow down your choice to a single storage system, you might need to still calculate the expected cost but just for FYI purposes rather than being an input to your decision making.

Regulations and Compliance

I am no expert in this area, but I have managed to find my way around it so far. Depending on different regulatory needs which your organization may need to comply with, your database choice may get impacted. This is especially the case when you are going to be storing personally identifiable information (PII). There are several relations to how your data storage system might be concerned with these compliance issues:

Nearly all regulations I have had to deal with so far required encrypting PII data at rest, meaning that you should store the data on disk as encrypted. Luckily, many databases offer this support out of the box, e.g. MongoDB, DynamoDB.
Transport level encryption is also something you need to have when connecting to the database from the clients. This is also something which many databases have support for. So, unlikely to be an issue.
One interesting part with regulations such as GDPR is that it gives users the right to be forgotten, meaning that you should have a way to delete relevant user data from your system upon request from the user within 28 days. For many databases, this is not much of an issue, but some architectural patterns such as event sourcing, and for the database which support this such as EventStoreDB you may need to plan ahead how you can do this, and what would be the implications.

Small Details That Could be Deal Breakers

Besides these, there are some small details which could turn out to be deal breakers depending on your circumstances.

One of them is database client support for a particular programming language. You found the near-perfect database for your needs. However, it turns out that there is no client for the programming language of your choice. Unless you are happy to stop what you are doing and write the client by yourself, this could be a deal breaker. However, there could be some steps you can take to overcome this deficit depending on how many alternative choices you have:

You might be able to change the programming language especially if you are considering this database to be used within a fresh codebase. This may or may not be applicable to you depending on your appetite and constraints you are working within.
In case you are thinking about using this database within an existing codebase, you can still take the approach described above but encapsulate the data access logic within a separate service that can be written in a different programming language, and exposed through common transport protocols (e.g. direct use of HTTP/2, or gRPC, etc.).
In case none of these are options and the database seems to have a C/C++ client, you might be able to wrap this client within the programming language of your choice, if there is a support for this (e.g. Go has this through cgo). This still requires extra initial effort, and potential maintenance effort. So, you should evaluate the trade-offs of going down this road vs. choosing an alternative.

The other one is around licensing. This is usually a concern when you pick an an open source database technology, and that technology turns out not to be open sourced with a permissive license. There could be differences between each license but this might concern you if you are especially distributing the software to your users directly. It's worth double checking the license and getting a clearance before moving forward with a choice.

Practicality of This Mental Model

This mental model worked for me so many times, and most of the choices I have made turned out to be sustainable options. However, you will see that this or similar approaches usually helps the most on narrowing down your choices. Therefore, it's likely that you will end up with multiple options that you can work with, and sometimes the trade-offs will also be balanced, and it might become harder to make a call. When that happens, the best tool in your disposal is going to be benchmarking. This approach will likely get you the most accurate quantitative data you will have in your hand to be able to compare choices, and allow you to gain a more solid perspective.

Also, when we get into the hunt of choosing a database, we may end up forgetting that the best database is sometimes no database. This is especially the case if you are working with ephemeral data which is easily repopulated, and can fit into memory. Not suggesting that this is something you should go with, but it's worth considering it for sure 🙂

Whiteboard-style Coding Interviews Might Not Be as Bad as You Think

Tugberk Ugurlu — Sat, 03 Apr 2021 22:07:00 +0000

Photo by ThisisEngineering RAEng on Unsplash

Whiteboard-style coding interviews has a bad reputation, and they are being perceived as "bad interview practice" in general within the software industry (e.g. "Why whiteboard interviews are BROKEN", "Coding Interviews are Broken", and there are many others like these). Despite this perception, many companies (especially the top-tier tech companies, a.k.a. FAANG and the likes) still hire software engineers through this interviewing process.

What I mean by "whiteboard-style coding interviews" is an assessment through algorithms and data structure based coding challenges which are performed within a coding environment that doesn't reflect the day-to-day experience of an engineer, e.g. whiteboard, rudimentary online coding editor, etc. This is also known as "leetcode-style", interviewing through CS trivia questions, etc. So, a physical whiteboard doesn't necessarily need to be present, but it's a name that's often associated with this type of interviews.

As someone who has conducted 100+ tech interviews within the last 4 years and also as someone who has failed this type of interviews nearly dozen times while being successful with a few of them in the last 10+ years, I would like to express my thoughts on why this perception might not be the most accurate one, why organizations still stick with this type of interviewing style despite the negative perception, and also why this type of interviews can and should even be perceived positively by candidates.

I am a person who believes that there is no cookie-cutter approach to the problem of hiring software engineers effectively. Each organization should essentially do what's working best for them, and most importantly it should be encouraged to experiment with different approaches to find the near-perfect balance based on what the organization is trying to optimize for. Therefore, my intention with this post is not to try to reject the hypothesis of whiteboard-style interviews being "bad interview practices". That said, in general, I do believe that gaining a different perspective on this type of interviewing processes will be helpful to all of us, mainly to understand the problem which is aimed to be solved through this type of interview processes. This is not only for us to eventually accept the process as is, and move on with our lives. However, if we understand why the process exist today in the way it's, we will then be able to form a better judgment of the process from the perspective of both sides, and we can then challenge the assumptions, and influence a change if necessary.

⚠️ Disclaimer 1: the content of this blog post is based on my personal opinions, and by no means represent the view of my current or former employers. So, just keep that in mind while reading the post, and be nice 🙂

⚠️ Disclaimer 2: I am acknowledging that interviewing is an area where quantifying the success of the certain interviewing processes is really challenging. Also, I am aware that all the information I will unleash here are not backed by any scientific evidence. However, they come from an experience. So, I still see it valuable to share. Nevertheless, take all this info in this post with a grain of salt.

⚠️ Disclaimer 3: I am aware that there are organizations in many shapes and sizes. It would be naive to think that there would be one interviewing style that would work for all of them. In this post, I am mainly focusing on organizations which work on tech-centric products, where the tech side of the business is not there to just implement features and fix bugs, but instead it's at the center of your organization to drive product decisions and innovations (in these organizations, tech becomes the organization's competitive advantage in the market).

In some parts, I might be implicit about this fact. So, keep this in mind. When I am referring to an "organization", you should now know what I mean by that.

There are Various Hiring Strategies

Let's start seeing the problem of hiring software engineers from the organization's view point. I personally believe that defining what type of software engineer you want to hire into the organization is a multifaceted problem, and plays a crucial role on understanding the reasons behind the interview process of an organization. I want to touch on three of these dimensions to narrow down the focus of this post. I am super aware that this is not an exhaustive list. However, based on my experience, they play a significant role on shaping up the interview process that software engineers go through. My hope here is that the information in this section will set foundational knowledge for us to understand why a certain interview process might be set in the way it is today, which will/should eventually make it more clear what the problem of "whiteboard"-style interviews are trying solve (more on that later).

Today's Problems vs. Tomorrow's

The hiring criteria usually differs a lot depending on whether you are hiring Software Engineers to solve today's problems vs. tomorrow's. This one indeed plays the biggest role on what type of interview process and what type of coding assessment Software Engineers may need to go through.

Today's Problems

Today's problems for an organization are well-known, and we are already aware of the challenges of those problems even if they may have not been solved yet. If you are hiring engineers on solving today's problems, meaning that you know what technologies, technical challenges, and architectural structure you will be working in the long run based on the today's view, you will have a pretty good idea on what you want from a Software Engineer.

For instance, if you as the organization are on k8s and you intend to stay on k8s for the rest of your existence as an organization, it's highly logical to assess the candidate's knowledge and experience on k8s as part of your interview process. Another example could be given here for a specific domain. Let's assume that organization's products are centered around a search functionality. So, it could be acceptable to think that organization can aim for hiring Software Engineers who have prior experience on search domain, and the assessment criteria could be centered around search domain during the interview.

This type of interview processes are relatively easy compared to the next topic we will discuss, because the organization has a high chance to be able to solidify the assessment criteria while also leaving less room for personal judgment. On the other hand, this type of hiring approach comes with a great deal of risk, and the reason is quite simple: tech-centric product design and development is highly volatile. It's not just that the problems, which we have to deal with, keep evolving, but also the technologies we use to solve these problems. These sometimes even change in the direction which haven't been imagined yet. Therefore, if your assessment criteria during an interview process is centered around a specific area, regardless of this being a specific technology, domain, or problem solving technique, you are betting on the successful candidates to adopt to these changes, be able to solve the unique problems which come with those, while also being able to innovate. Or, maybe as an organization, you are more naive and thinking that things will stay still for the forseeable future, and you will just be fine. However, I am hoping that no organization is betting on the latter.

Tomorrow's Problems

What I mean by tomorrow's problems is the notion of uncertainty from the user, business and technology centric challenges perspective. It's critical to being able to deal with this uncertainty potentially within an ambiguous environment (uncertainty often brings out ambiguity regardless of whether the organization is well-structured or not by default). Based on my experience, this is much more close to the reality within the current tech industry where things are changing fast, and you are required to adopt and move fast.

As an organization that wants to hire Software Engineers who will be able to cope with the challenges of both today's and tomorrow's problems, you want to hire engineers that fit into the creative thinking process through their sharp problem solving skills with critical and analytical thinking. This may sound great, but assessing engineers against these criteria is much, much harder compared to the process to hire engineers for today's problems. There are a few reasons for this, again, based on my own experience:

The criteria described above itself doesn't reveal a specific domain or technology for you to be able to pick for the assessment. This eventually guides the organization to boil down the evaluation to be performed on the basis of fundamental knowledge and problem solving skills (more on that later).
As there is no specific technology or domain here to assess, individual interviewer judgment might be more significant. This is especially the case when the interview process is set up correctly and effectively, meaning that the decision is not binary i.e. it's not "solved the coding problem. Therefore, it's a pass!", or vice-versa.
It's a challenge to make this type of interview process work in scale, both for the organization and the candidates. This is especially the case for big tech companies, which have thousands of engineers working for them.

Mixed Hiring Strategies

There is also a room for a hiring strategy which is mixed with both of the criteria mentioned above. At the end of the day, majority of the work for any tech-centric organization is going to be around problem solving through a creative thinking process. However, there is probably still small amount of (I would like to unscientifically say around ~10-15%) work which needs to be completed in the short-medium term, and requires specific skills or knowledge and experience on a specific technology. When you have this need, it's common to see organizations adopting a different hiring process for that to hire employees for the short term (while also still hiring full time employees to solve tomorrow's problems), i.e. as contractors. This is completely valid, and works well for both sides as long as this is kept in a minimum, and doesn't become your default and only hiring strategy. The reason is that:

The organization knows what they want in terms of hiring criteria, which happens to be a specific and easy to assess one.
Expectations are set correctly for both sides, i.e. short-term employees know they are hired to execute on the work based on their existing skills and organization knows that there are folks who are hired purely for execution.

Hiring Into a Team vs. Company

Depending on where you want to land a prospective employee after the successful outcome from the interview process, your hiring strategy can also differ, and can even be multiplied. If you are an organization where your teams have longevity, and work on specific domains, it's valid to hire into a specific team. This gives autonomy to each team to be creative about their own hiring strategy while also allowing the team to be much more specific about the assessment criteria, which will potentially lead to multiple hiring strategies to exist within an organization for the same role.

The other option is to hire software engineers into the organization based on a generic criteria, and defer the team selection to a later point. This often works better based on my experience, as it makes it much easier for engineers to move between teams within an organization which further helps for an organization to retain the talent under circumstances where the employee wants to change their team for one reason or another.

Specialist vs. Generalist

It's common that software engineers sometimes end up specializing within an area, e.g. Backend, Frontend, Test, QA, iOS, Security, and so on or and so forth. The list can go on, but the truth is that each of these roles requires different skill sets even if all these will require the person to write code and implement software to a certain extent.

On the other hand, some organizations purposely hire "Software Engineers". This can vary from organization to organization what this mean but in general this refers to the generalist software engineer, where they are well-versed on solving problems through designing, and implementing software without necessarily constraining themselves within a workflow. In my experience, even these hires end up specializing in one area of software engineering. However, these engineers can still contribute to pretty much throughout the whole lifecycle of the software delivery process.

In general, an organization's chosen interview process can be very different when it comes to hiring a specialist vs. generalist, but they can still have some fundamental common characteristics. I will touch on this later in this post.

Hiring Strategy Relationship with Whiteboard-style Coding Interviews

What does this information have to do with coding interviews, more specifically the whiteboard-style coding interviews? Having an understanding on what type of hiring strategies are out there, and which one is being used by the organization I am interviewing with helps me gain a wider perspective about their interview process, and makes me emphasize with the strategy. If I am convinced that hiring strategy and interview process aligns, this motivates me more. I am hoping this will at least be similar for you.

So, for the rest of the post, I will be making the assumption that the organization's hiring strategy has the following criteria:

Hiring engineers to solve tomorrow's problems, not just today's problems.
Hiring engineers into the organization, not particularly into a team. Even if the intention is to land them into a specific team, you are making the assumption that they can move within the organization.
Hiring generalist engineers, not specialist.

I made these choices here not because the alternatives are somehow bad. The main reason is that these are the criteria that the many organizations generally choose as part of their software engineer hiring strategy, and these contribute heavily to why whiteboard-style coding interviews become the core part of the assessment throughout the interviews.

Whiteboard-style Coding Interviews

There seems to be notable amount of speculation about the negative side of whiteboard-style coding interviews. I say speculation here on purpose, not because it's people's intention to speculate, and all of the information out there is completely speculative. I say that because genuinely there are a lot of misconceptions and lack of rationale out there when judging these interviews. As humans we sometimes forget that most things are not black and white, and trade-offs matter. I believe that's what really is happening here. We end up judging these processes from one side: the candidate's.

While acknowledging that every organization's intentions are different with this type of interviews, I personally believe, with the experience I have both as an interviewer and interviewee, I have a pretty solid idea on why these interviews are shaped up in the way they are today in general. Most importantly, it's actually largely positive for the candidates that these interviews are set up the way they are today.

These Interviews are Intentionally Structured

This type of interviews are assessing the candidate's core problem solving ability through coding while also assessing the candidate's fundamental computer science knowledge around algorithms, data structures, and complexity analysis. Throughout the interview, interviewers will look for signals to give them a higher confidence on these fronts, and this is the key to understand! The evaluation here is not binary, and it will likely be linked to how you communicate your ideas as well as how you execute on them, but let's ignore that for now to purely look at why these matter. If we go back to the tomorrow's problems point we touched on above, the reasoning will make much more sense. Even if the organization accepts the fact that tomorrow's problems are unpredictable and full of uncertainty, they need to be able to form some assumptions around some commonalities of these problems. One of those assumptions happen to be that tech-centric problems will eventually require understanding of the core algorithms, data structures as well as being able to solve problems effectively through coding.

They also assess the candidate's critical thinking ability within context of a problem that needs to be solved through coding. This gives a pretty solid idea on what type of thought process you have, and how you reason about problems as a Software Engineer. This extends from thinking about edge cases to seeing opportunities within your solution to optimize it proactively in terms of various aspects (e.g. modularizing your solution, improving the runtime performance, choosing your test cases, etc.).

These coding challenges also set up a pretty good environment to be able to assess your analytical ability, to a certain extent. Can you reason about how your solution would perform with different input sizes? Can you ask the correct questions upfront to gather an accurate enough analytical reasoning, and proactively determine the optimum solution according to the info collected? All of this can give pretty accurate signals.

One of the things that these questions don't asses is the rote knowledge of the candidate on a particular technology such as a programming language. Unless you are being hired as a specialist on a specific programming language, you will likely be given the freedom to choose the language that you are most familiar with, even if that language is not among the languages that the organization is using. This is great, as it gives you the choice for the language which you are the most comfortable with. However, this doesn't mean that you don't even need to be fluent in that programming language. It's quite the opposite. As you are given the choice to choose the best language which you feel comfortable with, you are expected to be fluent with the basics of the language, and it's very likely that you will only need the basics throughout the interview.

As a contrary to point above, one other thing that's not being evaluated is your rote knowledge on the syntax of the programming language you have chosen. As you will likely be performing for these interviews within an unfamiliar environment such as an online code editor that won't have autocompletion, or any other IDE features that you might already be familiar with day to day, it's understandable that you may make some syntax mistakes here and there, and your code may not actually compile. This is totally OK, and you should not worry about this too much. That's one other reason why you will likely not be given any option to be able to run your code. The core reason for this is to prevent to noise around retrieving the accurate signals that the interviewer is looking for, and get you spend as much time as possible on the code logic, implementation, and testing. There is also a fundamental assumption here: if you are an engineer who is able to perform well within these interviews by solving the given problems effectively, you will almost always be able to find syntax problems and solve them. So, the potential time that you might have spent fixing these issues would have actually lowered your chances because you would not be giving any useful signals to the interviewer during that time.

Hiring Strategy and Coding Interview Structure

If we relate the structure of the coding interviews with the hiring strategy I mentioned in the previous section, we should now be able to see how things start to make more sense (🤞🏼):

The organization, which we are basing our assumptions on in this post, wants to hire generalist software engineers into the organization (not into a specific team) to solve tomorrow's problems, not just today's problems. So, the assessment is not restricted according to the work the organization is taking on today. The software engineering candidates are evaluated according to their core problem solving and computer science knowledge, based on several different coding challenges in varying degree of difficulty to maximize the chance of accurate, and high quality signals to be retrieved to reduce the chance of a false-positive, or false-negative. Coding interviewers are only one part of the entire interview process. However, they on its own give pretty accurate signals for the minimum bar around he candidate's:

skills to use a programming language to solve a particular problem through coding, while being able to form an algorithm
ability to be able to critically think about the problem, while being able to ask the right questions to widen their understanding on the problem, proactively finding out the edge cases, and potential optimizations
analytical reasoning for a given problem, according to requirements which may or may not be ambiguous to start with.

Whiteboard-style Coding Interviews are not Perfect

Whiteboard-style coding interviews are no silver bullet. The hiring is a complicated problem to solve in general, and this becomes much harder when it comes to hiring talented people to fit into an environment where creative thinking, and core problem solving skills matter the most, if not as much as the core technical skills. It would have been very naive to think that one solution would work perfectly to solve this complicated, multifaceted problem. It's again worth emphasizing that every organization is different. However, we should have a shared understanding by know what type of organization I am referring to.

As applicable to nearly everything else in our precious World, this type of interviews comes with their own trade-offs, and it's worth acknowledging them so that we can do as much as possible to mitigate them. The following list of negatives about these interviews are based on my own experience, and not meant to be a definitive and exhaustive list.

They require significant amount of preparation for candidates. This is a legitimate one, and I would like to believe that it's one of the biggest negatives that's felt commonly among all the candidates, including myself. As it's likely that you are not working directly with the all data structures day to day, and spending all your time around algorithmic problems, you need to at least refresh your memory around these concepts, while also making sure that you can give the accurate signals to the interviewers within 45-60 minutes. So, you gotta have the knowledge and skills, and prove that you have it fast. This requires preparation, and not everyone has the privilege to dedicate this amount of effort.
The structure can be copied by some organizations without actually forming an hiring strategy at the first place. This is the worst negative about these interviews, by far! As this interviewing style is adopted by top tech companies, they must be working effectively and efficiently, correct? Great, let's adopt the same process, and call it a day. Well, not so much. I hope that we now have a shared understanding by now that hiring strategy plays a huge role on why these interviews exist in the way they are today. So, without understanding the why, it's going to be a miserable experience for your organization and the candidates if you end up copying the process. Don't do that, understand your organization's unique problems first, at least then see if this type of coding interviews are a good fit for the problems you are trying to solve as part of your hiring strategy.
They are occasionally executed poorly, and they lead to either false-negatives or false-positives. Sometimes the execution doesn't meet with the intentions or the expectations, and this type of interviews may occasionally be performed poorly. This could be a one-off occurrence due to the interviewer's own performance, or it might be a systematic issue (it could actually be related to the point above). For instance, the candidate might be rejected because they didn't get the syntax of the programming language correctly. It could also be that the candidate has performed really well on a coding interview because they already knew the questions, and executed flawlessly on paper without actually giving much high quality accurate signals around the problem solving ability. Highly calibrated interviewers have the ability to retrieve the correct signals under these circumstances to understand when this is happening, and can often mitigate this type of situations really well. However, less-calibrated interviewers may sometimes overlook this, and potentially lead to a false-positive outcome.

💡 A word on false-negatives: The topic of false-negatives is outside the purpose of this post, but comes up a lot when coding interview topic is brought up. False-negative interview outcomes are the ones which concludes the candidate's rejection when the same candidate would have actually performed well on the job (speculatively) even if the candidate's performance was less than strong during the interview itself. These are super common in software engineering interviews, and more common especially in coding interviews. These outcomes might potentially cause an organization to miss out on incredible talent. However, when you think deeply about this, false-negatives are actually much more favorable compared to false-positives for the organization's, their employees' success and happiness in the long run.

The reason is simple: false-positives will lead to a potential decline of the talent within the organization, and this can grow exponentially. Once this starts happening, the talent you retained in your organization might also start being unhappy about this because talented employees tend to want to work with talented colleagues, and this will eventually lead to attrition of the talent and high turnover rate.

Let's assume that within the last month you had a hundred borderline outcomes through your interview process, and you let half of them slip while taking the risk on these employees, as you didn't want to miss out on a potentially great talent. Let's again assume half of these hires turn out to be false-positives, and they start interviewing candidates within 6 months or so. They will likely end up hiring engineers under the bar, and good portion of those engineers will likely also turn out to be false-positives. Those hires will also end up interviewing engineers within the next 6 months or so, and this cycle will keep continuing, and it won't be much longer to see the impact of this exponential cycle.

You should also remember that once you hire the wrong person into your organization, letting go of that person is super painful and a long process, unless you are adopting a structure similar to what Netflix operates under (which I personally respect and relate to).

These interviews favour new graduate software engineers more than the seniors. The assumption seems to be that the new graduate software engineer candidates can perform much better in this type of interviews as they have the fresh knowledge from the college around all the core algorithms and data structures subject, whereas the senior engineers won't have the fresh knowledge around these. This might reflect the truth depending on what type of job a senior engineer is performing day-to-day. If you are not making use of these concepts during your daily job, you might need more preparation. However, going back to the core evaluation criteria, it is also likely the case that senior engineers potentially have the edge when it comes to critical thinking ability, and core problem solving. So, this one is a bit up in the air for me even if there is some truth to it.
These interviews not at all in favour of software engineers who don't have a formal computer science education. This is largely correct, but also largely depends on what type of learning you have accumulated. If we were to speak commonly though, it's going to be very rough until these candidate can feel comfortable around this type of coding challenges, let alone feeling confident to perform well on the interviews. I am one of these folks without a formal CS education background. I still remember the time where I bombed a coding interview long time ago when the interviewer asked me to implement a stack data structure without using any collection types. I didn't really know how to it, and wasn't also able to reason about the problem. However, I took this as a challenge, and over the years, I have accumulated tons of knowledge around algorithms, data structures, and complexity analysis thanks to this interview style. More importantly, this knowledge proved to be immensely useful many times while I was designing, implementing and reviewing systems, and code changes. Long story short, this is not really a negative. It's either an excuse or opportunity for you to learn and grow yourself. In an era where the entire lectures of formal CS courses are published online (e.g. see Introduction to Algorithms and Advanced Data Structures from MIT OpenCourseWare), this knowledge is much more easily acquirable than you might think.

Untold Truth?

⚠️ OK, this section is purely based on assumptions, and totally speculative. If you are type of person who likes to focus on facts, or at least the opinions formed based on experience, you may wanna skip this. If not, buckle up!

We should be open about the fact that this style of coding interviews require us to prepare, and that takes significant amount of time commitment from the candidate's point of view. It's a privilege to have that time, but let's assume you have that, it also still takes significant amount of diligence and perseverance to be able to perform strongly in these interviews. The outcome is also not left to luck. It's common that the candidate goes through 4-6 coding challenges throughout the hiring process, and one less than strong performance could be enough for them to get rejected.

I personally believe that the preparation characteristics of these interviews on its own puts off great deal of candidates. These candidates could actually perform well on the job if they were to get it, but they never try because they don't have the enough enthusiasm and perseverance to go through this process (even if they might have the time). I also speculatively believe that this is a good thing for the organization, especially the ones which can already attack significant amount of candidates, because they are implicitly evaluating the candidate's enthusiasm for the role, and general attitude on perseverance through this characteristic of the interview style.

I don't know whether this is the correct assumption or not, but wanted to put it out there here. Considering that software engineering hiring game favouring false-negatives over false-positives (for the right reasons in my opinion), this assumption might not be too far off.

Common Misconceptions

Based on the common rationale and understanding we have laid down, I believe it's now fair to be able to judge some of the common misconceptions about whiteboard-style coding interviews. Here are some of them that I am aware of which are worth specifically highlighting, this is not meant to be an exhaustive list:

You are being asked about questions which don't reflect the day to day experience of the engineers working in the organization, and there needs to be a system that should close the gap between the interview and the actual job that is being interviewed for.
- If you are hiring software engineers only for today's problems, by all means, you should absolutely do this. If, however, you want to hire software engineers to solve not just today's problems, but also the tomorrow's ones, this advice introduces a lot of risks for the business.
Interviewing environment doesn't reflect the reality, e.g. you are left with a whiteboard or code editor which doesn't have autocompletion and even a compiler.
- This ties back to assessment criteria and the signals which the interviewer is trying to retrieve from the candidate throughout the interview. Being able to get the syntax of the programming language, and how to use the compiler is not one of those criteria. The potential time that you might have spent fixing the potential issues related to these areas would have actually lowered your chances because you are not giving any useful signals to the interviewer during that time. So, it is actually in your favour that you are forced to use a rudimentary coding environment for the test.
The candidates with interviewing anxiety is being passed on unfairly.
- Who doesn't have interviewing anxiety? Interviews are stressful as hell, but being able to control the stress and anxiety is also part of our lives. I also have to put this here: if actors with millions of dollars net worth are willing to take auditions, I am sure we, as Software Engineers, should be able to justify to ourselves that it might actually be reasonable to take the job interviews 🙂

Conclusion

TL;DR; is that whiteboard-style coding interviews exist for certain specific reasons, mainly to be able to assess the candidate's core knowledge around algorithms and data structures, and how well they will likely fit into the creative thinking process through their sharp problem solving skills with critical and analytical thinking to solve not just today's problems, but also tomorrow's unique problems within a potentially ambiguous environment. I believe this is largely a good thing when these interviews are setup and executed effectively by the organizations and interviewers. They often give a strong indication on what type of organization you are going to be jumping into, which usually tends to be the one who sees software engineers not just execution machines, but also as part of their product development and innovation process, and values their abilities much more strongly.

They surely have some downsides, biggest one of which is that these interviews require significant upfront preparation, and not everyone has the privilege to commit a significant amount of time for this. The process itself also favours false-negatives over false-positives, which is a very good thing for the organization and their teams, but can be super demoralizing for the candidates.

They are not perfect for the problem that they are trying to solve, for sure. However, I am yet to see an alternative solution to this problem that works as effectively and efficiently as this one when hiring at scale according to the criteria we have just laid down (yes, take-home assessments are not an alternative!).

Further Resources

Coding Interviews are Broken
Why whiteboard interviews are BROKEN
Thread based on Jaana Dogan's recent Tweet on this topic Very informative thread, it can be useful if you want to gain further perspective on how different people think about this.
Coding Interviews are NOT Broken
Hiring Without Whiteboards, it seems to be curated list for companies (or teams) that don't do "whiteboard" interviews
Get that job at Google Steve Yegge's very informative post on hiring at Google. Very old now, but surprising still a lot of relevant bits. It's worth checking out.
Moishe Lettvin - What I Learned Doing 250 Interviews at Google
Coding Interview Problem: Least Disruptive Subrange Very helpful example coding challenge walk-through with full of rationale from Jackson Gabbard, an ex-Facebook Software Engineer.

C++, Getting Started with the Basics: Working with Dependencies and Linker

Tugberk Ugurlu — Tue, 23 Mar 2021 18:46:00 +0000

Intro

As I mentioned in my previous post about C++, I am learning C++. It has been a bumpy ride so far, and C++ is certainly not an easy to pick up programming language! So, I thought what better way to make the learning stronger than blogging about my journey and pinning down my experience. You now know that the reason this post exists is a bit selfish, but I am hoping it will be helpful to some other folks who are going through the same while also acknowledging that everyone's mental model is different. So, YMMV.

In this post, I want to share my experience of incorporating a 3rd party dependency into my own program, and understanding what goes under the hood during the compilation and linking phase of the build process. For the purposes of this, I will be aiming to incoorporate gflags C++ library into my own program. This library is providing support to be able to define and parse commandline flags.

Taking a Library Dependency inside Our Own C++ Code

The way you should be stating a library dependency in your own C++ code is through the #include directive by specifying the header file that you want to take a dependency on. As we probably know by now that the header file doesn't actually contain the implementation, but only declares the contract between the library and the consumer. We will shortly touch on how we will be able to tie the header file with its implementation.

As we can see inside the gflags documentation, the header file we want to work with is called gflags/gflags.h. That immediately raised some questions for me. I am sure it will for you if you happen to be a newbie in C++ World like me. The biggest one of all is where gflgas/ folder is relative to. That will become more clear when it comes to the building part. So, for now, let's assume it's magic™️.

As we learned about how to take a dependency on this library within the code, here is how our sample program looks like:

#include <iostream>
#include <gflags/gflags.h>

DEFINE_string(name, "Tugberk", "Name of the person to greet");

int main(int argc, char *argv[]) {
    gflags::ParseCommandLineFlags(&argc, &argv, true);
    std::cout << "Hello " << FLAGS_name << std::endl;
}

Nothing fancy, and you can see the gflags documentation about the specifics of our usage here. The purpose of this post is not to explain that. The only reason that we are using gflags here to demonstrate how to take a dependency on an external library, and it is an easy to use one that won't be hard to explain.

However, one thing that's worth noting is the usage of gflags:: before the ParseCommandLineFlags function call. gflags that's being referred here is the namespace, which we are betting that it will be declared within the gflags.h header file. gflags::ParseCommandLineFlags is the fully-qualified reference to the function we want to invoke.

Alternatively, we could have imported the entire gflags namespace, and be able to call ParseCommandLineFlags directly without a namespace declaration like the following, which would mean that you can use anything under that namespace directly:

#include <iostream>
#include <gflags/gflags.h>

using namespace gflags;

DEFINE_string(name, "Tugberk", "Name of the person to greet");

int main(int argc, char *argv[]) {
    ParseCommandLineFlags(&argc, &argv, true);
    std::cout << "Hello " << FLAGS_name << std::endl;
}

Based on my understanding, there is nothing wrong with this in terms of performance of the program or the compiler (I could be wrong, don't quote me on this). However, this will likely increase your changes of having a name collisions, and also it will make it a bit hard to read the code (i.e. it's not immediately clear where ParseCommandLineFlags is coming from).

One other alternative is to just declare a using for the type you want to use:

#include <iostream>
#include <gflags/gflags.h>

using gflags::ParseCommandLineFlags;

DEFINE_string(name, "Tugberk", "Name of the person to greet");

int main(int argc, char *argv[]) {
    ParseCommandLineFlags(&argc, &argv, true);
    std::cout << "Hello " << FLAGS_name << std::endl;
}

Although this still suffers from the same problems I listed above to a certain extent, this is a bit better especially when you are planning to use the defined type a few times within the same file.

Final thing I want to note within this code is the use of DEFINE_string. It's also defined within the same header file. However, that's a Macro and it doesn't seem to be tied to a namespace. I don't have much info about Macros at this stage, but wanted to touch on the rationale of why it's being used in this way.

Setting up the Build Pipeline

We have our implementation which should give us a command like program where we can call hello-world --name Bob and that would print out Hello Bob for us. To be able to demonstrate different build variations, I am going to run the build within a Docker container. Configuration for this is going to be very simple. The code we have seen above will be inside the main.cpp file. Also to start with, we will also have a build.sh file with the following content:

#!/bin/bash

g++ -v ./main.cpp -o hello-world

-v is here to give verbose output from the compiler which will be handy when it comes to understanding what goes under the hood. The Dockerfile content will be as following:

FROM ubuntu

RUN apt-get update && apt-get -y install build-essential

WORKDIR /opt/
RUN mkdir app
WORKDIR /opt/app

COPY ./ ./

RUN ./build.sh
CMD ["./hello-world", "--name=Bob"]

When I run docker build . with this setup, I'm getting an error:

...
...
Step 7/7 : RUN ./build.sh
 ---> Running in 39ce491a452e
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.3.0-17ubuntu1~20.04' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-HskZEa/gcc-9-9.3.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) 
COLLECT_GCC_OPTIONS='-v' '-o' 'hello-world' '-shared-libgcc' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/9/cc1plus -quiet -v -imultiarch x86_64-linux-gnu -D_GNU_SOURCE ./main.cpp -quiet -dumpbase main.cpp -mtune=generic -march=x86-64 -auxbase main -version -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -o /tmp/ccebxWeM.s
GNU C++14 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
	compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, MPC version 1.1.0, isl version isl-0.22.1-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring duplicate directory "/usr/include/x86_64-linux-gnu/c++/9"
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/9/include-fixed"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/9/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/include/c++/9
 /usr/include/x86_64-linux-gnu/c++/9
 /usr/include/c++/9/backward
 /usr/lib/gcc/x86_64-linux-gnu/9/include
 /usr/local/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
GNU C++14 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
	compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, MPC version 1.1.0, isl version isl-0.22.1-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 466f818abe2f30ba03783f22bd12d815
./main.cpp:2:10: fatal error: gflags/gflags.h: No such file or directory
    2 | #include <gflags/gflags.h>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
The command '/bin/sh -c ./build.sh' returned a non-zero code: 1

There are a few important things to call out here:

As you may remember from the previous C++ post, the compiler is looking under several directories which includes /usr/local/include and a few others right after hitting the #include directives during its preprocessing stage.
We can see that compilation is failing with the following error: gflags/gflags.h: No such file or directory. That's giving us an indication that the header file with the path of gflags/gflags.h wasn't found in any of the include directories which the compiler was searching under.

This is an expected error at this stage, because gflags is a 3rd party library, this is a fresh box and we didn't install that library.

Back to Basics: Compilation of a C++ Program

Let's pause a bit and learn some fundamentals. I kept mentioning compilation, like it's a black box where you give it some input and get an output, compiled object back. Most of the time, this type of thinking will get us where we want to be. However, my aim here is to understand what's going on under the hood a bit more. When I went a bit deeper to understand the build process for C++, I have found out that the build step is broken down into three independent steps:

Preprocessing: This stage handles the preprocessor directives, like #include and #define. After the processing of these directives, the preprocessor produces a single output.
Compilation: The compilation step is performed on each output of the preprocessor, and this is the step where the C++ code is converted into assembly code. This step also involves the assembler to turn the assembly code into machine code, then producing an actual binary file (a.k.a. object file). The bit that's super interesting at this stage is that these object files can refer to symbols that are not defined, and this is how the header files are being compiled at this stage without any specific implementation.
Linking: This is the final step within our build process, and this step is handled through the linker which produces the final output for our program from the object files that the compiler produced. This output can be either a library or an executable. It links all the object files by replacing the references to undefined symbols with the correct addresses, and if the definitions exist in libraries other than the standard one, the linker needs to be informed about these specificity, which is relevant to what I am trying to achieve in this post (more on this to come later).

You can check out this incredible Stackoverflow answer on this topic, which explains compilation steps of a C++ program more in-depth, and I copied most of what I mentioned in this section from there.

Preprocessing the Headers

Let's install gflags according to the installation guidelines of this library, and rerun the compilation:

diff --git a/1-dependency/Dockerfile b/1-dependency/Dockerfile
index fbaeba8..58215ea 100644
--- a/1-dependency/Dockerfile
+++ b/1-dependency/Dockerfile
@@ -1,6 +1,7 @@
 FROM ubuntu
 
 RUN apt-get update && apt-get -y install build-essential
+RUN apt-get -y install libgflags-dev
 
 WORKDIR /opt/
 RUN mkdir app

If I run docker build . command again, it still gives me an error but this time error is different:

COLLECT_GCC_OPTIONS='-v' '-o' 'hello-world' '-shared-libgcc' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/9/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/9/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper -plugin-opt=-fresolution=/tmp/ccjLVDaH.res -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc --build-id --eh-frame-hdr -m elf_x86_64 --hash-style=gnu --as-needed -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie -z now -z relro -o hello-world /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/Scrt1.o /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o -L/usr/lib/gcc/x86_64-linux-gnu/9 -L/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/9/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/9/../../.. /tmp/ccg11F4K.o -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-linux-gnu/9/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/crtn.o
/usr/bin/ld: /tmp/ccg11F4K.o: in function `main':
main.cpp:(.text+0x27): undefined reference to `google::ParseCommandLineFlags(int*, char***, bool)'
/usr/bin/ld: /tmp/ccg11F4K.o: in function `__static_initialization_and_destruction_0(int, int)':
main.cpp:(.text+0x12e): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
collect2: error: ld returned 1 exit status
The command '/bin/sh -c ./build.sh' returned a non-zero code: 1

We are still not quite there yet. However, as a software engineer, you know that this is a great feeling! You made some progress, and the changes that you have just made had some impact to move you forward 🙂

What has happened here is that the compiler was able to find the header file to be able to preprocess the #include directives. However, where did it find it? We can try to look for gflags.h file inside the container and see where it's located:

# find / -iname gflags.h
/usr/include/gflags/gflags.h

This makes more sense now as /usr/include is one of the directories where the compiler is looking for to find the header files.

Linking

The error we have received this time seems to be coming from ld, the linker, and it seems to be indicating that there are undefined references to several objects and functions under google namespace.

/usr/bin/ld: /tmp/ccg11F4K.o: in function `main':
main.cpp:(.text+0x27): undefined reference to `google::ParseCommandLineFlags(int*, char***, bool)'

It's worth noting where this google:: namespace comes from. This library seems to be exposed under two namespaces: gflags and google. All the documentation is referring to gflags. However, it seems like any usage under that namespace eventually seems to be redirected to google namespace. It took a while for me to understand why and how, but I documented the investigation in this Stackoverflow question. I would suggest for you to check that out first before basing any assumptions on the namespace usage.

This error is also expected, as we haven't told the compiler yet what library dependency we want to link to, a.k.a archive, or static library. For static library files, the filenames always start with lib, and end with .a (archive, static library) on Unix/Linux (see this post for reference). We can use the -l command line option of the g++ compiler, which would eventually pass this to ld to add the archive file to the list of files to link. This option may be used any number of times. ld will search its path-list for occurrences of lib{archive}.a for every {archive} specified.

With this in mind, we should be able to complete our compilation journey by passing -lgflags option to g++ compiler:

The error output above might be confusing you since it seems like /usr/lib/gcc/x86_64-linux-gnu/9/collect2 is invoked directly, not ld. Quick search suggests to me that collect2 eventually calls ld but I am not sure at this stage why and how the compiler located collect2 at the first place, and decided to call it instead of calling ld directly. For simplicity, I will ignore collect2 for the rest of the post, and only mention ld.

#!/bin/bash

g++ -v ./main.cpp -lgflags -o hello-world

Now, let's run docker build . with this setup:

...
...
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
ignoring duplicate directory "/usr/include/x86_64-linux-gnu/c++/9"
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/9/include-fixed"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/9/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/include/c++/9
 /usr/include/x86_64-linux-gnu/c++/9
 /usr/include/c++/9/backward
 /usr/lib/gcc/x86_64-linux-gnu/9/include
 /usr/local/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
GNU C++14 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
	compiled by GNU C version 9.3.0, GMP version 6.2.0, MPFR version 4.0.2, MPC version 1.1.0, isl version isl-0.22.1-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 466f818abe2f30ba03783f22bd12d815
COLLECT_GCC_OPTIONS='-v' '-o' 'hello-world' '-shared-libgcc' '-mtune=generic' '-march=x86-64'
 as -v --64 -o /tmp/ccZZScyH.o /tmp/ccvjxfiH.s
GNU assembler version 2.34 (x86_64-linux-gnu) using BFD version (GNU Binutils for Ubuntu) 2.34
COMPILER_PATH=/usr/lib/gcc/x86_64-linux-gnu/9/:/usr/lib/gcc/x86_64-linux-gnu/9/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/9/:/usr/lib/gcc/x86_64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/x86_64-linux-gnu/9/:/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/9/../../../../lib/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/9/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-v' '-o' 'hello-world' '-shared-libgcc' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/9/collect2 -plugin /usr/lib/gcc/x86_64-linux-gnu/9/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper -plugin-opt=-fresolution=/tmp/cc10fZHH.res -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lgcc --build-id --eh-frame-hdr -m elf_x86_64 --hash-style=gnu --as-needed -dynamic-linker /lib64/ld-linux-x86-64.so.2 -pie -z now -z relro -o hello-world /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/Scrt1.o /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/9/crtbeginS.o -L/usr/lib/gcc/x86_64-linux-gnu/9 -L/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/9/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/9/../../.. /tmp/ccZZScyH.o -lgflags -lstdc++ -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-linux-gnu/9/crtendS.o /usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/crtn.o
COLLECT_GCC_OPTIONS='-v' '-o' 'hello-world' '-shared-libgcc' '-mtune=generic' '-march=x86-64'
Removing intermediate container ce5a3c257fe2
 ---> 455abaa9d2d9
Step 9/9 : CMD ["./hello-world", "--name=Bob"]
 ---> Running in 2b17e00b3210
Removing intermediate container 2b17e00b3210
 ---> cc8ae20c8aa8
Successfully built cc8ae20c8aa8

Build passed! If we look at the compiler output from this, we should be able to see that -lgflags option is passed to the linker:

Based on the information we have about the linker and with the -lgflags option being passed to it now, we know that the linker is looking for libgflags.a static library file to use as part of the linking process. Where did it find it though, and how did it knew to look there at the first place? Let's look for that file within the container:

➜ docker run -it cc8ae20c8aa8 /bin/sh
# find / -iname libgflags.a
/usr/lib/x86_64-linux-gnu/libgflags.a

That seems to be existing under /usr/lib/x86_64-linux-gnu folder. This is the folder where architecture specific libraries live under Ubuntu. If we also look at what's being passed to the linker through the -L command line option, which adds a path to the list of paths that ld will search for archive libraries and ld control scripts, we will see that /usr/lib/x86_64-linux-gnu is already bing passed.

Nice, the C++ build process is now making more sense for me 🙂

Just to make sure things are working as expected, I will run the container I have just built.

➜ docker run cc8ae20c8aa8                  
Hello Bob
➜ docker run cc8ae20c8aa8 ./hello-world --name=Alice
Hello Alice

It works as expected 🎉

Resources

These are the resources I benefited from while writing this post. It's only fair I give these some credit. They might not entirely beneficial to you though:

C++, Getting Started with the Basics: Hello World and the Build Pipeline

Tugberk Ugurlu — Fri, 19 Mar 2021 01:09:00 +0000

Intro

I am probably the least qualified person to be writing a blog post about C++. So, please approach this post with some caution. But, why am I writing it then? Well, I am currently learning C++, and it has been an unusual experience compared to my other programming-language-learning journeys. At this stage, my assumption is that the main difference that causes me to struggle comes from the fact that:

C++ doesn't have a universally agreed way to define how projects should be structured and built (well, sort of)
C++ also doesn't have one defined way to manage your dependencies (well, sort of)
Finally, C++ is not a programming language that will spit out an executable which will perform garbage collection for you out of the box (well, it really doesn't have this)

All of these are quite unusual characteristics for me when learning a programming language. I admit, I have been spoiled! The other thing to admit is that I have used C++ before in projects, but on all those occasions, the projects and build structures were already set up, and I only needed to maintain the codebase, and do occasionally changes in them. Also, these were projects which didn't run under a significant scale to generate those usual high-scale issues. So, nothing really required me to deeply understand how the C++ and its toolchain worked.

However, I knew that a few obstacle wouldn't wear me down, and I had to find a way to regain my perseverance! So, I thought what better way to make the learning stronger than blogging about my journey and pinning down my experience. And, here we are! You now know that the reason this post exists is a bit selfish, but I am hoping it will be helpful to some other folks who are going through the same while also acknowledging that everyone's mental model is different. So, YMMV!

If you are still here, let me tell you what this post is all about! I will be going through the "Hello World" experience for C++, while also taking the explanation a bit beyond and understanding how the build pipeline works by attempting to dive deep into the bowels of the compiler (but I also now I am probably only scratching the surface)!

Hello World

I am learning a programming language. So, I should be concerned about the syntax, right? Well, usually but I really don't care about that at this stage, at least too much. At the moment, I am making an assumption that I will be able to get a handle of the syntax gradually as I start solving actual problems. What I really wanted to focus is the toolchain experience, and how I can glue things together.

So, I focused on getting a bare minimum program written and understanding what goes in that as much as possible. As you can guess, it's the good-old "Hello World" program. Here is the code for that which is saved within the main.cpp file.

#include <iostream>

int main() {
    std::cout << "Hello World" << std::endl;
}

Nothing fancy here. However, I was able to learn so many things just from this small program!

First line is the #include directive, which allows us to define the dependency between our source file and the "include file" (a.k.a. header file), which are the files that contains the constant and macro definitions, declarations of external variables and complex data types. These files don't contain the actual implementation. I know that you have more questions now, but hopefully I will be able to touch more into this in the upcoming posts.
C++ has a Standard Library, which contains a collection of classes and functions, which are written in the core language and part of the C++ ISO Standard itself.
iostream#Input/output_streams) is part of the standard library, and provides C++ input and output fundamentals. std:: prefix represents the standard library.
cout is coming from the standard library, and represents the standard output stream.
You can write into an output stream using the "insertion operator" (i.e. <<), which sends bytes to that output stream object.

The Build Pipeline

The question now is how we can run this. C++ is a static typed language which requires a compilation step (well, it's more involved than compilation but hopefully we will get there, hang in there!). So, we need to compile the source code into an executable which we can run. This is where things got a bit more interesting for me because there isn't one compiler that you can use for C++. There are at least three of them (possibly more?): g++, gcc, clang++ (well, I guess you can also count c++ which is actually a symbolic link). I honestly don't know at this stage enough about these to be able to tell you about the difference between them. So, for the purposes of not getting stuck, I am going to go with g++ for now, and there is not a particular reason to why I chose it other than the fact that most examples I have come across so far have been using g++ 🤷🏻‍♂️.

Here is something more fun! g++ on my Mac actually ends up calling clang. Go figure 🤷🏻‍♂️. If you are someone who understands the diff between C++ compilers, please direct me to a resource which would make me understand what is really going on here. I have given up on this for now 😕

OK, compiler choice is sorted out, kind of. Let's compile this now. Here is the simplest compilation we can run which will spit out an executable called hello-world:

g++ ./main.cpp -o hello-world

If we add -v flag, we will actually be able to see in more details what's going on underneath:

➜ g++ -v ./main.cpp -o hello-world
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
 "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple x86_64-apple-macosx10.14.0 -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -emit-obj -mrelax-all -disable-free -disable-llvm-verifier -discard-value-names -main-file-name main.cpp -mrelocation-model pic -pic-level 2 -mthread-model posix -mdisable-fp-elim -fno-strict-return -masm-verbose -munwind-tables -target-sdk-version=10.14 -target-cpu penryn -dwarf-column-info -debugger-tuning=lldb -target-linker-version 450.3 -v -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/10.0.1 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -I/usr/local/include -stdlib=libc++ -Wno-atomic-implicit-seq-cst -Wno-framework-include-private-from-public -Wno-atimport-in-framework-header -Wno-quoted-include-in-framework-header -fdeprecated-macro -fdebug-compilation-dir /Users/tugberkugurlu/go/src/github.com/tugberkugurlu/cmake-getting-started/0-hello-world -ferror-limit 19 -fmessage-length 95 -stack-protector 1 -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fobjc-runtime=macosx-10.14.0 -fcxx-exceptions -fexceptions -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics -o /var/folders/l4/2c_f_d8973z3g7lmkb9xcw9h0000gn/T/main-d3ea6f.o -x c++ ./main.cpp
clang -cc1 version 10.0.1 (clang-1001.0.46.4) default target x86_64-apple-darwin18.7.0
ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/v1"
ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/local/include"
ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/Library/Frameworks"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include
 /Library/Developer/CommandLineTools/usr/include/c++/v1
 /Library/Developer/CommandLineTools/usr/lib/clang/10.0.1/include
 /Library/Developer/CommandLineTools/usr/include
 /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include
 /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/System/Library/Frameworks (framework directory)
End of search list.
 "/Library/Developer/CommandLineTools/usr/bin/ld" -demangle -lto_library /Library/Developer/CommandLineTools/usr/lib/libLTO.dylib -no_deduplicate -dynamic -arch x86_64 -macosx_version_min 10.14.0 -syslibroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -o hello-world /var/folders/l4/2c_f_d8973z3g7lmkb9xcw9h0000gn/T/main-d3ea6f.o -L. -L/Users/tugberkugurlu/.tensorflow-1.11.0/lib -L/usr/local/lib -lc++ -lSystem /Library/Developer/CommandLineTools/usr/lib/clang/10.0.1/lib/darwin/libclang_rt.osx.a

Lots of useful things to unpack from this output, which really helped me understand how the compiler is behaving! I cannot say that I currently understand all of it, but let me try to explain what I have been able to extract from this so far:

clang compiler is called to compile the main.cpp file, and spit out the file called main-d3ea6f.o, which contains the compiled object code. That file is being put under the temporary /var/folders/l4/2c_f_d8973z3g7lmkb9xcw9h0000gn/T folder, so that the compiler can refer back to it later.
The compilation will happen for the target x86_64-apple-darwin18.7.0. I am assuming this is used as the default because I am performing this compilation on my Mac, and I haven't specified a target for the compiler.
While compilation is happening, the compiler is looking under several directories which includes /usr/local/include and a few others right after hitting the #include directive. These directories are known as include directories, and these are where the header files are being looked for. In my case, iostream header file is located under /Library/Developer/CommandLineTools/usr/include/c++/v1.
Once the compilation is performed, ld is invoked. ld is the linker (see man ld), which combines several object files and libraries, resolves references and produces an output file. If you look at the output, you can see that main-d3ea6f.o file which contains our compiled object code is passed into ld as one of its arguments. There are also a few folders passed in as an argument here, and one of them is /Users/tugberkugurlu/.tensorflow-1.11.0/lib, which is a bit strange. The reason that's there is because it's set as one of the paths through the LIBRARY_PATH environment variable for me. Colon-separated list of directories through this environment variable is used by the linker when searching for special linker files.
You can also see the output of the ld is specified as -o hello-world, which is the name that we have given to g++ compiler.

I am most likely glossing over a lot of details here. I have found this article to be a very informative when it comes to explaining the C++ build pipeline, by breaking it down to three steps called preprocessing, compilation and linking. So, please check it out for more thorough explanation.

After the compilation and linking, we end up with an executable file called hello-world, and we can execute it to see our super complex output:

➜ ./hello-world
Hello World

What's Next?

This was the very basic example. I am sure you can relate to the fact that almost none of the real world problems will be solved with this simple implementation. The next part for me will be to look at how we can work with a multi-file project as well as being able to take an external open source library as a dependency.

I am new on my C++ journey. So, if you see anything wrong and things that I can benefit from, please do leave a comment on this post 🙂

Resources

Configure Free Wildcard SSL Certificate on AWS Application Load Balancer (ALB) Through Terraform

Tugberk Ugurlu — Fri, 25 Dec 2020 22:10:00 +0000

Last week, I have moved all my personal compute and storage from Azure to AWS. I took this opportunity as an excuse to also start to manage all that infrastructure through Terraform. Why AWS though? I had the chance to use AWS before on and off, but since I joined Deliveroo 2 years ago, I have been using AWS exclusively and extensively. So, it's the least friction for me when it comes to working with a cloud provider. That said, this migration has still been a really great learning experience, and it also emphasized it more for me that AWS is million miles ahead in their journey when it comes to developer experience. Things just work ™️, especially when it comes to gluing things together (we will see in an example of that in this post). When they don't, it's also very obvious the reasons, which makes it easy to diagnose what's going wrong (although, it's probably because of IAM for like 99.9% of the cases).

During this migration, I have also discovered that you can actually configure SSL on your own domain for free, without any additional charges through AWS Certificate Manager (ACM) if you are already using AWS Application Load Balancer (ALB). This was a valuable find for me, as I needed to enable HTTPS for this blog which I have been procrastinating to get it done, like forever. However, when I think about it, the additional payment for the SSL certificate wasn't the only reason that was making me delay getting one. It was more to do with the cost of maintenance that I didn't really want to get into (e.g. certificate renewals and all that).

ALB and ACM integration addresses both of these issues, by providing a way to configure SSL as well as keeping it automatically renewed without any additional charges. To be fair, there is probably also a way to automate this all on Azure, but I have been also away from that world for over 2 years now, and I didn't have the mental capacity to sort it out. Anyway, enough with the excuses, and let's see how to make this all sorted through Terraform.

Request Certificate Creation With AWS Certificate Manager

For the purpose of serving the content of this blog through HTTPS, I wanted to create an SSL certificate for www.tugberkugurlu.com. However, I also wanted to have the option to serve other content under subdomains. That led me to look into whether I can actually create a wildcard certificate, and this turned out to be possible. As started in the ACM characteristics docs, ACM allows you to use an asterisk (*) in the domain name to create an ACM certificate containing a wildcard name that can protect several subdomains.

With that information, the next step was to see how Terraform would allow me to create a wilcard certificate. Terraform AWS provider already has a resource to create the certificate, which is called aws_acm_certificate:

resource "aws_acm_certificate" "tugberkugurlu_com" {
  domain_name               = "tugberkugurlu.com"
  subject_alternative_names = ["*.tugberkugurlu.com"]
  validation_method         = "DNS"
}

Let's take a look what each of these things mean:

domain_name: Fully qualified domain name (FQDN), that you want to secure with an ACM certificate.
subject_alternative_names: Additional FQDNs to be included in the Subject Alternative Name extension of the ACM certificate. Here, we can use an asterisk (*) to create a wildcard certificate that protects several sites in the same domain. However, note that the asterisk (*) can protect only one subdomain level when you request a wildcard certificate. For example, *.tugberkugurlu.com can protect foo.tugberkugurlu.com and bar.tugberkugurlu.com, but it cannot protect foo.bar.tugberkugurlu.com. Another thing to note here is that *.tugberkugurlu.com protects only the subdomains of tugberkugurlu.com, it does not protect the domain apex (i.e. tugberkugurlu.com in our case here). That's why I am providing that through the domain_name.
validation_method: ACM needs to validate that you actually own this domain before it can issue a public certificate. This validation can be performed through either EMAIL or DNS. I am going with DNS here for several reasons:
- DNS validation is required to be able to renew your certificates automatically through the managed certificate renewal process.
- DNS validation also allows us to complete the whole process through Terraform if you use Route53 as your domain's DNS host, and managing its state through Terraform as well.

This is all we have to do to request a creation of the certificate, and like I mentioned, the certificate renewal is going to be performed automatically for us since we are creating the initial certificate with DNS validation. See this documentation around ACM certificate renewal to find more about how the automatic renewal works. It's also worth mentioning that there is further configuration you can provide. So, I suggest you to check out the Terraform documentation for the ACM certificate resource to learn more about those options in case you end up needing them.

Domain Name Validation Through Route53 DNS Configuration

Before I applied the changes, I also wanted to make sure that the domain validation side of the story is also sorted out. I use Route53 as my DNS service, and this made it so much easier to perform the validation.

To be fair, even without this, it shouldn't be too much of a hassle as DNS validation is just going to be a one-off process regardless of the approach. So, it's still low friction even you need to perform this manually.

The main data point I needed to hook into for this was domain_validation_options attribute which is exported through aws_acm_certificate resource for my certificate. Quoting from the documentation directly, this attribute gives you the domain validation objects which can be used to complete certificate validation. Note that this can have more than one value. So, wee need to keep that in mind when we are using this.

This is great as we can use this value to create a Route53 record through the aws_route53_record resource. This object exports a few attributes for us:

domain_name: The domain name to be validated.
resource_record_name: The name of the DNS record to create to validate the certificate.
resource_record_type: The type of DNS record to create, e.g. A, CNAME, etc.
resource_record_value: The value the DNS record needs to have.

The only issue was to figure out how to iterate over domain_validation_options array, and create a aws_route53_record resource for each. Luckly, Terraform has a way to make this work through for_each meta-argument, which allows us to create an instance for each item in that map or set.

Here is how my aws_route53_record resource declaration looked like:

resource "aws_route53_record" "tugberkugurlu_com_acm_validation" {
  for_each = {
    for dvo in aws_acm_certificate.tugberkugurlu_com.domain_validation_options : dvo.domain_name => {
      name   = dvo.resource_record_name
      record = dvo.resource_record_value
      type   = dvo.resource_record_type
    }
  }

  zone_id = aws_route53_zone.tugberkugurlu_com.zone_id
  name    = each.value.name
  type    = each.value.type
  ttl     = 60
  records = [
    each.value.record,
  ]

  allow_overwrite = true
}

zone_id here refers to the zone_id attribute from an aws_route53_zone resource which I already had declared for this domain.

Another interesting bit here is the allow_overwrite argument which is used to allow creation of this record in Terraform to overwrite an existing record, if any. It turns out that domain_validation_options can result in duplicate DNS records, and this argument seems to have been added just for this purpose.

This should on its own be enough to get the validation performed. However, one caveat here is that the validation will happen asynchronously. Therefore, your certificate might be usable right away (a.k.a. the good old eventual consistency). Terraform already has a solution for this, too through the aws_acm_certificate_validation resource. This resource implements a part of the validation workflow and represents a successful validation of an ACM certificate by waiting for validation to complete. Note that this doesn't represent a real-world entity in AWS. So, changing or deleting this resource on its own has no immediate effect.

As we already have the aws_acm_certificate and aws_route53_record(s) for the validation, we can easily declare an aws_acm_certificate_validation resource:

resource "aws_acm_certificate_validation" "tugberkugurlu_com" {
  certificate_arn         = aws_acm_certificate.tugberkugurlu_com.arn
  validation_record_fqdns = [for record in aws_route53_record.tugberkugurlu_com_acm_validation : record.fqdn]
}

As there can be multiple aws_route53_record.tugberkugurlu_com_acm_validation resources, we make use of the Terraform for expression to assign validation_record_fqdns argument.

These are all for what's needed for the certificate creation and its validation. Once I executed the terraform apply command, the certificate was created and it was all ready to use.

Wiring It up with Application Load Balancer

I am going to skip what AWS ALB is, how it works, and how to configure it to start directing traffic to your resources (e.g. ECS services, Lambda, EC2 instances, etc.). However, it's worth checking out the ALB documentation before this post if you don't have a good grasp of its concepts.

I already had created an Application Load Balancer under my account through Terraform, with a Target Group wired to my ECS Service. When you create an ALB, AWS assigns a domain name for you so that you can access the ALB publicly. HTTPS is enabled on this domain name, but it's highly likely that you want to hide this away by allowing access to your site through your own domain. When that's the case, the HTTPS certificate will stop working properly since the the ALB server could not prove that it is the domain that's being accessed through.

Therefore, we need a way to wire up our own certificate issued to our own domain with the ALB resource. AWS makes this super easy when the certificate is issued through ACM. What you need to do is to attach a listener to your load balancer through aws_lb_listener Terraform resource to listen on port 443. Then, you can also attach the SSL certificate we have on ACM. Here is how my configuration looked like for this:

resource "aws_lb_listener" "tugberkugurlu_com_https_forward" {
  load_balancer_arn = aws_lb.tugberkugurlu_com.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = aws_acm_certificate.tugberkugurlu_com.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.tugberkugurlu_com.arn
  }
}

It's pretty self explanatory, but there a few things that are worth touching on:

load_balancer_arn: This points to the ALB resource ARN (Amazon Resource Name) which I had previously created through aws_lb resource.
ssl_policy: You can see the Security Policies section of the ALB documentation, which gives more information about this. To be frankly honest here, I didn't fully understand the full extend of this configuration, and just used what's recommended for compatibility. For what's worth, it doesn't seem to have a notable implication when I used the recommended value for what I was trying to achieve.
certificate_arn: This points to the ACM certificate ARN which we have created previously with one of the steps above.
default_action: Default action for the listener. In my case here, I want it to direct traffic to resources I configured with the target group.

I'm intentionally skipping the explanation around target groups here (e.g. how it's defined, and how it works, etc.) in this post since it would easily be in the size of a single blog post on its own post. Besides that, it's worth noting that the action type here doesn't have to be forward, it can be any of the allowed rule action types.

One other thing that I want to mention around target groups is that a target's protocol doesn't have to be the same as the listener protocol. Therefore, you can you use ALB for SSL termination here by configuring your HTTP endpoints within the target group with HTTP protocol.

⚠️ Don't forget to allow ingress traffic for TCP port 443 through your security group for the ALB once you add the HTTPS listener. Otherwise, the requests won't hit your ALB listener:
resource "aws_security_group" "tugberkugurlu_com_lb" {
  name        = "lb-sg"
  description = "controls access to the Application Load Balancer (ALB)"

  # ...

  ingress {
    protocol    = "tcp"
    from_port   = 443
    to_port     = 443
    cidr_blocks = ["0.0.0.0/0"]
  }

  # ...
}

Once I applied this, I had the HTTPS working for tugberkugurlu.com 🎉

➜ curl -v https://www.tugberkugurlu.com
* Rebuilt URL to: https://www.tugberkugurlu.com/
*   Trying 3.139.131.63...
* TCP_NODELAY set
* Connected to www.tugberkugurlu.com (3.139.131.63) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=tugberkugurlu.com
*  start date: Dec 16 00:00:00 2020 GMT
*  expire date: Jan 14 23:59:59 2022 GMT
*  subjectAltName: host "www.tugberkugurlu.com" matched cert's "*.tugberkugurlu.com"
*  issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*  SSL certificate verify ok.
...
...

Redirecting HTTP Traffic Through an ALB Rule

As I didn't previously have HTTPS on this web site, all the existing links out there (e.g. all of the pages were indexed on search engines) were pointing to HTTP on port 80. So, I wanted to still be able to listen on port 80, and redirect traffic back to port 443 to serve it over HTTPS.

ALB also makes this super easy, as you can wire up more than one listener onto your load balancer (maximum 50 listeners are allowed per load balancer). So, we can do that for port 80 on HTTP protocol, and use the redirect action to direct traffic to port 443 on HTTPS protocol.

Here is how this looks like in Terraform:

resource "aws_lb_listener" "tugberkugurlu_com_https_redirect" {
  load_balancer_arn = aws_lb.tugberkugurlu_com.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type = "redirect"

    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}

After I applied this change through terraform apply, it all worked as expected:

➜ curl -v http://www.tugberkugurlu.com 
* Rebuilt URL to: http://www.tugberkugurlu.com/
*   Trying 3.14.215.140...
* TCP_NODELAY set
* Connected to www.tugberkugurlu.com (3.14.215.140) port 80 (#0)
> GET / HTTP/1.1
> Host: www.tugberkugurlu.com
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 301 Moved Permanently
< Server: awselb/2.0
< Date: Fri, 25 Dec 2020 22:17:03 GMT
< Content-Type: text/html
< Content-Length: 134
< Connection: keep-alive
< Location: https://www.tugberkugurlu.com:443/
< 
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
</body>
</html>
* Connection #0 to host www.tugberkugurlu.com left intact

Resources

Redis Cluster - Benefits of Sharding and How It Works

Tugberk Ugurlu — Sun, 20 Dec 2020 01:34:00 +0000

Content

The Problem

Redis Cluster: Enter

Key Distribution

Hash Tags: Getting back into control of your sharding strategy

Redirection

Distributing Reads

Conclusion

Redis is by far one of the most frequently used data stores. It's fascinating how much our our software-developer-minds go to Redis when we are faced with a data storage problem that requires some level of scale. Even if this might make us feel guilty, I have a somewhat confident assumption that this's the case, and there is probably a relation here to its simplicity: e.g. Redis is 'just' a data structure server, a hash table in the 'cloud', etc. (I know I am a bit exaggerating here, but hopefully you get the idea). Redis also makes digestible and reasonable trade-offs, and it allows us to solve many problems which require certain degree of scale.

For a long time, Redis has come with an out-of-the-box replication functionality, which allows for a high availability (HA) setup as well as allowing us to scale the reads by distributing the load across replicas with the cost of eventual consistency. However, it was only in April, 2015 that Redis added support for a built-in sharding functionality with its version 3 release. I have been working with several Redis Cluster setups for a while, and have probably read the Redis Cluster spec at least couple of times. In this post, my aim is to give you more understanding on what problem Redis Cluster actually solves, why you need such a setup, and most important details you need to know about its configuration and implementation details based on my own experience.

The Problem

When designing a software system, we have somewhat of an idea what the scale of usage is going to be on that system. This could be based off of previous usage patterns on the same or similar functionality, based on the data you collected over an experiment that has been run with a rudimentary functionality in a smaller scale, or based on just a pure guess. If you are mature enough as a business, you should also be able to project how much the expected growth is going to be for the forseeable future (e.g. next 12, 24 months). All of this data should help on determining a baseline number, where you can then be able to extrapolate to understand the load estimations for the system that you are designing.

Being a software engineer, I bet you also have the urge to boil these estimates down to peakiest number of writes/reads per second so that you can reason about these numbers in a relatable way, and can test your system accordingly before going to production. The ideal scenario is also that you want to be on the comfortable side, and will likely want to have 20% over scaling here in case your estimation turns out to be wrong.

So far so good, and this is exactly what I would expect from a software engineer who knows what they are doing and have proper critical thinking skills. The reason is that these numbers will help you choose the shape and size of the resources you want to set up (e.g. the node size of your Elasticache Redis instance, etc.), which will help you optimize your resources. That said, we still have problems with this:

These estimations are just estimations, and they will almost certainly turn out to be wrong. When they are higher than you expected, you will struggle with the load. When lower, you will likely burn money unnecessarily and will be overscaled more than you really like it to be.
There will always (I actually mean 'always' here) be unforeseen business activities or external events which will impact the load on your system (e.g. marketing campaigns, etc.). These activities may actually have dramatic impact on the per-second based load. In those circumstances, you need to find a way to accommodate the needs of the new load without actually having any downtime.

Why am I talking about these? These problems are actually what makes Redis Cluster as the suitable candidate for your needs when those problems are especially centered around the writes. For reads, you might still be able to get away with a single master setup by wiring up as many replicas as you need. This should allow you to distribute the read load across replicas at the cost of data consistency gap depending on the replication lag, which would take the pressure off from the master. When the load is lower and you don't need all the replica, you can tear those down to save some £££. All of these operations shouldn't really require too much logic on the clients, and you should really be able to get away with by only employing a logic to figure out a new Redis replica addition, and start directing requests to it.

However, the matter is not that simple for writes. One option we have here is scaling up the nodes (i.e. adding more resources). However, that is going to be a complex operation to perform without introducing a downtime. There is also a limit to how much you can scale up to (although for the majority of use cases out there, you may never need to go close to that limit). This could still be an option when the issue is with memory. However, not so much for CPU. When it comes to Redis, your CPU is rarely the issue. It's throughput that ends up becoming the bottleneck.

If we want to approach this problem the same way we have approached the read scaling issue, there are some questions that really deserve an upfront answer:

How the clients are going to know which node to write data into, and read data from?
What will happen when we add a new node to scale the writes?
What will happen when we remove a new node to scale down?
How can we distribute the load evenly across the nodes?
If we are making multi-command operations (e.g. pipeline requests, MGET, etc.), how are those going to work with this model?

Don't get me wrong here: these are not unique Redis problems. Any data storage system that needs to scale the writes face the same challenges, and there are some common techniques such as data sharding, and we are now about to see how Redis tackles these problems through the same technique, with some spice added on top to cater for its unique needs.

Redis Cluster: Enter

Since v3.0, Redis has included an out of the box support for a data sharding solution, which is called Redis Cluster. It provides a way to run a Redis installation where data is sharded across multiple Redis nodes as well as providing tools to manage the setup. These Redis nodes still have the same capabilities as a normal Redis node, and they can have their own replica sets. The only difference is that each node will be only holding the subset of your data, which will depend on the shape of the data and Redis' key distribution model (don't worry about this now, we will get to this concept shortly).

I have configured a local Redis cluster setup to use throughout this blog post, and with the help of CLUSTER NODES command, I can see its high level structure:

172.19.197.2:6379> CLUSTER NODES
b7366bdbb09dbb20dcf0d4f8b7281c98f7e3b78e 172.19.197.7:6379@16379 master - 0 1608418117542 10 connected 10923-16383
164dc6aaf77aa0530490f0c9fbf5c8eb9f653a53 172.19.197.5:6379@16379 slave fdf56116c8b8f322561c7189574e6092101fa718 0 1608418118557 12 connected
f75939944d18ee12995c60d4cc9fcc1e53458d32 172.19.197.3:6379@16379 slave 88875e065f5ecf24b5adde973223a7799aee4521 0 1608418117949 11 connected
fdf56116c8b8f322561c7189574e6092101fa718 172.19.197.2:6379@16379 myself,master - 0 1608418118000 12 connected 0-5460
1c822510aa0f349a9b12cba1c68bc98feab5433e 172.19.197.4:6379@16379 slave b7366bdbb09dbb20dcf0d4f8b7281c98f7e3b78e 0 1608418118000 10 connected
88875e065f5ecf24b5adde973223a7799aee4521 172.19.197.6:6379@16379 master - 0 1608418118963 11 connected 5461-10922

You can learn more about the serialization format of this output from the doc, but let me take a stab at summarizing it:

We have setup of 3 master nodes with each having one replica.
We are currently connected to the node at 172.19.197.2:6379, and its node ID is fdf56116c8b8f322561c7189574e6092101fa718. We know this is the node we are connected as the myself flag indicates the the node you are contacted. This node is also one of the master nodes.
The node that we are connected is shown to be responsible for 0-5460 slot range (don't worry about what exactly this is now, we will shortly get to this).
The node at 172.19.197.5:6379 is the replica of the current node which we are connected to. We know this as the node ID of fdf56116c8b8f322561c7189574e6092101fa718 is shown under the master column and we know that this the ID of the node that we are connected to.

At this point, you should have more questions in your head compared to when you have started reading this post, which is not good :) So, I am hoping to guess what those questions are and try answer at least some of them proactively.

However, note that Redis Cluster Specification already does a pretty good job on the details. With that in mind, my aim is not to duplicate that documentation here. That said, I want to still highlight the most impactful parts that are valuable to focus based on my own experience working with Redis cluster.

Key Distribution

This section is all about essentially answering our first question above regarding which node holds which data. Redis has an interesting way of making this work which seemed to have worked for the use cases I have experienced with. Here is the very high level summary of how it works:

Redis assigns "slot" ranges for each master node within the cluster. These slots are also referred as "hash slots"
These slots are between 0 and 16384, which means each master node in a cluster handles a subset of the 16384 hash slots.
Redis clients can query which node is assigned to which slot range by using the CLUSTER SLOTS command. This gives clients a way to be able to directly talk to the correct node for the majority of cases.
For a given Redis key, the hash slot for that key is the result of CRC16(key) modulo 16384, where CRC16 here is the implementation of the CRC16 hash function. I am no expect when it comes to cryptography and hashing, but here is how this can be done in Go by using the snksoft/crc library. Note that Redis also has a handy command called CLUSTER KEYSLOT which performs this operation for you per given Redis key. The clients are expected to embed this logic so that they can directly communicate with the correct node with the help of CLUSTER SLOTS command mentioned above.
Same as the single node Redis setup, Redis Cluster uses asynchronous replication between nodes. So, each shard can have its own set of replicas which would be responsible for the same subset of the hash slots as its master. These replicas can be used for failover scenarios as well as distributing the read load (which we will touch on later).

For example, if you have a setup of 3 master nodes with each having 3 replicas, it would look something like the following:

The specific ranges of the hash slots doesn't matter here too much, even the fact that they might be balanced fairly (as we will touch later, we can have influence over slot allocation if we need to). What matters is that it's clear which master node owns.

As an example, I have a local Redis cluster setup which has 3 master nodes, and I am connected to one of them (172.19.197.2) through redis-cli. When I run the CLUSTER SLOTS command, I can see that the node I am connected to handles hash slot range between 0 and 5460:

172.19.197.2:6379> CLUSTER SLOTS
...
...
2) 1) (integer) 0
   2) (integer) 5460
   3) 1) "172.19.197.2"
      2) (integer) 6379
      3) "fdf56116c8b8f322561c7189574e6092101fa718"
   4) 1) "172.19.197.5"
      2) (integer) 6379
      3) "164dc6aaf77aa0530490f0c9fbf5c8eb9f653a53"
...
...

I want to set 4 keys, which I already know that falls into the slot range of this node:

172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.7
(integer) 717
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.6
(integer) 4844
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.2
(integer) 4712
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.3
(integer) 585
172.19.197.2:6379> SET coffee_shop_branch.status.7 PERMANENTLY-CLOSED
OK
172.19.197.2:6379> SET coffee_shop_branch.status.6 PERMANENTLY-CLOSED
OK
172.19.197.2:6379> SET coffee_shop_branch.status.2 OPEN
OK
172.19.197.2:6379> SET coffee_shop_branch.status.3 CLOSED
OK
172.19.197.2:6379> KEYS *
1) "coffee_shop_branch.status.7"
2) "coffee_shop_branch.status.6"
3) "coffee_shop_branch.status.2"
4) "coffee_shop_branch.status.3"

I can also successfully read these the same way I would have done with a single node Redis setup:

172.19.197.2:6379> GET coffee_shop_branch.status.7
"PERMANENTLY-CLOSED"
172.19.197.2:6379> GET coffee_shop_branch.status.6
"PERMANENTLY-CLOSED"
172.19.197.2:6379> GET coffee_shop_branch.status.2
"OPEN"
172.19.197.2:6379> GET coffee_shop_branch.status.3
"CLOSED"

Hash Tags: Getting back into control of your sharding strategy

In certain cases, we would like to influence which node our data is stored at. This is to be able to group certain keys together so that we can later access them together through a multi-key operation, or through pipelining.

One use case here would be to satisfy the access pattern of retrieving the status of multiple coffee shops within the same city, where we don't have a way to group these together during write time. Therefore, it makes sense to write the status of each coffee shop under their individual keys, and access the ones that we care about through pipelining, or MGET.

⚠️ I am mentioning MGET as an option here as it is technically a viable option. However, keep in mind that MGET blocks other clients till the whole read operation completes, whereas pipelining doesn't since it's just a way of batching commands. Although you may not see the difference with just a few keys, it's not a good idea to use MGET for too many keys. I suggest for you to perform your own benchmarks for your own use case to see what the threshold might be here.

Idea is solid but there is still a question: how can we make sure that coffee shops under the same city are co-located within the same node? For example, if we also have the coffee shops with ID 1 and 4, they are not going to be stored within the same node as coffee shops with ID 2, 3, 6 and 7 based on our current setup (remember: the node at 172.19.197.2 is responsible for hash slot range of 0-5460):

172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.1
(integer) 8715
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.4
(integer) 12974

172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.2
(integer) 4712
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.3
(integer) 585
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.6
(integer) 4844
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.status.7
(integer) 717

You can also see that Redis will also complain when we try to MGET all of these keys:

172.19.197.2:6379> MGET coffee_shop_branch.status.1 coffee_shop_branch.status.2 coffee_shop_branch.status.3 coffee_shop_branch.status.4 coffee_shop_branch.status.6 coffee_shop_branch.status.7
(error) CROSSSLOT Keys in request don't hash to the same slot

We can also see the same behavior even if we remove coffee_shop_branch.status.1 and coffee_shop_branch.status.4 from the list of keys. This is because the fact that MGET can only succeed if all of the keys belong to same slot as the error message suggests.

172.19.197.2:6379> MGET coffee_shop_branch.status.2 coffee_shop_branch.status.3 coffee_shop_branch.status.6 coffee_shop_branch.status.7
(error) CROSSSLOT Keys in request don't hash to the same slot

This is where the concept of hash tags comes in. Hash tags allow us to force certain keys to be stored in the same hash slot. I encourage you the read the linked section of the spec to understand better how hash tags work as I am going to skip some corner cases here, but in a nutshell, the concept is really simple from the usage point of view: when the Redis key contains "{...}" pattern only the substring between { and } is hashed in order to obtain the hash slot.

For our use case, this means that we can change our key structure from coffee_shop_branch.status.COFFEE-SHOP-ID to something like coffee_shop_branch.{city_CITY-ID}.status.COFFEE-SHOP-ID. The exact shape of the key is not important here. What's important is that the value between curly braces which is the city ID prefixed with city_ for readability purposes.

For the example that we have been working with, and with the assumption that the coffee shops with ID 1, 4, 2, 3, 6 and 7 are all with the same city, let's say that it's the city with ID 4, the keys will shape up as following, and we can see from the CLUSTER KEYSLOT command outcome that all of these keys are hashed to the same slot:

172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.{city_4}.status.1
(integer) 1555
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.{city_4}.status.4
(integer) 1555
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.{city_4}.status.2
(integer) 1555
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.{city_4}.status.3
(integer) 1555
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.{city_4}.status.6
(integer) 1555
172.19.197.2:6379> CLUSTER KEYSLOT coffee_shop_branch.{city_4}.status.7
(integer) 1555

We can also see that MGET will start working as expected with these keys:

172.19.197.2:6379> MGET coffee_shop_branch.{city_4}.status.1 coffee_shop_branch.{city_4}.status.4 coffee_shop_branch.{city_4}.status.2 coffee_shop_branch.{city_4}.status.3 coffee_shop_branch.{city_4}.status.6 coffee_shop_branch.{city_4}.status.7
1) "OPEN"
2) "CLOSED"
3) "OPEN"
4) "CLOSED"
5) "PERMANENTLY-CLOSED"
6) "PERMANENTLY-CLOSED"

So, hash tags are great, and we should use them all the time, right? Not so fast! This approach can make a notable positive impact on the latency of your application, and resource utilization of your redis nodes. However, there is a drawback here which might be a big worry for you depending on your load and data distribution: the Hot Shard problem (a.k.a. Hot Key problem). In our use case for instance, this can be a significant problem when certain cities hold way more coffee shops than the others, or the access for certain cities are significantly higher even if the data sizes are the same. I will leave this super informative post from 2010 here, which is about one of the Foursquare outages. You will quickly realise after reading the post-mortem that it was caused by the exact same problem.

Hash tags is a tool that can help you, but there is unfortunately no magic bullet here. You need to understand your use case, data distribution, and test different setups to understand what might work for you the best.

Redirection

Apart from the MGET example above, we have been playing it by the rules so far: knowingly issuing commands against the nodes that actually hold the data for the given keys. We were able to do this through the couple of cluster commands that Redis provides such as CLUSTER SLOTS and CLUSTER KEYSLOT.

What would happen if we do the opposite though: issuing a command against a Redis node which doesn't actually own the hash slot for the given key? Here is the answer:

172.19.197.2:6379> get coffee_shop_branch.status.1
(error) MOVED 8715 172.19.197.6:6379

Redis is erroring, but erroring in a more clever way than you probably have guessed. The error itself includes the hash slot of the key, and the ip:port of the instance that owns that hash slot and can serve the query. This is called MOVED redirection in Redis spec, and all the Redis Cluster clients are expected to handle this error appropriately so that they can eventually succeed the request by connecting to the correct node and issuing the command there.

redis-cli, as being one of the Redis clients, also knows how to handle MOVED redirection. The CLI utility implements basic cluster support when started with the -c switch.

➜ docker run -it --rm \
    --net redis-cluster_redis_cluster_network \
    redis \
    redis-cli -h redis_1
redis_1:6379> get coffee_shop_branch.status.1
(error) MOVED 8715 172.19.197.6:6379
redis_1:6379> exit

➜ docker run -it --rm \
    --net redis-cluster_redis_cluster_network \
    redis \
    redis-cli -c -h redis_1
redis_1:6379> get coffee_shop_branch.status.1
-> Redirected to slot [8715] located at 172.19.197.6:6379
"OPEN"
172.19.197.6:6379>

You can see that on the first case when we connected to a Redis node through redis-cli without the -c switch, we got the MOVED redirection. However, in the case where we used the -c switch, the client handled the redirection transparently by connecting to the given Redis node, and issuing the command there.

However, Redis already gives a way to identify which master node is responsible for which hash slot range, and Redis cluster clients should also be able to generate the hash of a given key to figure out which node to connect to. So, why is this feature useful? There are two main key reasons that I am aware of:

First one is that Redis cluster specification doesn't require Redis Clsuter clients to be clever about routing, meaning that clients don't need to keep track of which master nodes serve for which hash slot range. Instead, they can just have the logic to be able to handle the redirection to be considered a complete Redis Cluster client. I don't exactly know what the reason was for this, but I presume this made it easier for existing Redis clients to adopt to be a Redis Cluster client at the time. That said, these clients have a major drawback that they are so much inefficient compared to their clever counterparts since these clients have a high change of making at least twice the number of requests than they need to for the majority of the operations they perform.

Another reason why we have the MOVED redirection in place (probably the most important one) is related to resharding. For instance, when a new master node is added to the Redis Cluster to offload some of the pressure from the existing nodes, it's expected to perform some of the cluster reconfiguration operations to move certain hash slot ranges from the existing nodes to the new node. This would trigger a what-is-commonly-known-as resharding operation, and Redis aims to handle this without causing a disruption. However, when this happens and certain hash slot ranges are being moved from one node to another, there is a chance that the client can have the stale information about the cluster during this phase. This might cause the client to connect to the old node which used to be responsible for a given hash slot, instead of the correct node which took charge of that slot after the client retrieved the latest state of the cluster. This is where the MOVED redirection is handy, and it also hints to the client to reload its cluster configuration.

I am aware that we haven't touched on the resharding point in depth yet (and we won't be in this post), but redirection is such a fundamental concept of the Redis Cluster specification that I wanted briefly to go over at a high level. Also note that there is another type of redirection which is known as ASK redirection, and we won't be covering that here at all since it's fundamentally related to resharding and that one really deserves its own post.

Distributing Reads

The last point I want to touch on is around scaling reads, where we can make use of the replicas to distribute the load. For example, with the setup that we have been working with in this post, we have a replica per each master node. Considering we have 3 master nodes, by default, 3 nodes are serving reads and writes. However, we can utilize the replicas to serve the read commands which would essentially double the number of nodes that can serve reads.

This is great but it's at the cost of data consistency since Redis uses by default asynchronous replication unless you are using the WAIT command to enforce a synchronous replication during write time.

Let's assume that we are OK with the data inconsistency, and we are monitoring the replication lag. How can we utilize these replicas for reads? We can start by exploring this through redis-cli. From our previous exploration, we know that the node at 172.19.197.5:6379 is the replica of the node at 172.19.197.2:6379. So, let's connect to that node directly, and issue a GET command there:

➜ docker run -it --rm \
    --net redis-cluster_redis_cluster_network \
    redis \
    redis-cli -c -h 172.19.197.5
172.19.197.5:6379> get coffee_shop_branch.{city_4}.status.4
-> Redirected to slot [1555] located at 172.19.197.2:6379
"CLOSED"
172.19.197.2:6379>

That's a surprising outcome as we were being redirected to the node at 172.19.197.2:6379 which is the master node of the replica that we were connected to. From this, it seems like the replica either doesn't hold the data that we need, or it doesn't allow any read operations.

Let's first check whether it actually holds the data. Looking at the KEYS stored at that node, it seems like it has the data that we need:

172.19.197.5:6379> KEYS *
 1) "coffee_shop_branch.status.3"
 2) "coffee_shop_branch.status.6"
 3) "coffee_shop_branch.status.7"
 4) "coffee_shop_branch.{city_4}.status.2"
 5) "coffee_shop_branch.{city_4}.status.4"
 6) "coffee_shop_branch.status.2"
 7) "coffee_shop_branch.{city_4}.status.7"
 8) "coffee_shop_branch.{city_4}.status.3"
 9) "coffee_shop_branch.{city_4}.status.6"
10) "coffee_shop_branch.{city_4}.status.1"

When we check the replica status, we can also see that the replica is up-to-date:

172.19.197.5:6379> INFO replication
# Replication
role:slave
master_host:172.19.197.2
master_port:6379
master_link_status:up
master_last_io_seconds_ago:8
master_sync_in_progress:0
...
...

It seems like the replica doesn't allow us to perform any read operations, and this is expected which is also documented inside the Redis Cluster spec:

Normally slave nodes will redirect clients to the authoritative master for the hash slot involved in a given command, however clients can use slaves in order to scale reads using the READONLY command.

READONLY command enables read queries for a connection to a Redis Cluster replica node. This command hints to the server that the client is OK with the potential data inconsistency. This command needs to be sent per each connection to the replica nodes and ideally should be sent right after the connection is established.

➜ docker run -it --rm \
    --net redis-cluster_redis_cluster_network \
    redis \
    redis-cli -c -h 172.19.197.5
172.19.197.5:6379> READONLY
OK
172.19.197.5:6379> get coffee_shop_branch.{city_4}.status.4
"CLOSED"
172.19.197.5:6379>

To be honest, I remember that this threw me off when I first realized this behavior. However, it makes sort of a sense to be explicit when it comes to reading stale data. My only gripe about it is the name of the command which is sort of confusing. That said, you get used to it after a while, and it's well supported by the clients (e.g. go-redis client has a way for you to configure this as well as being able to configure the replica routing behavior).

Conclusion

Redis cluster gives us the ability to scale our Redis setup horizontally not just for reads but also for writes, and you should consider it especially if you have a write heavy workload where you cannot easily predict the demand ahead of time. The sharding model Redis is offering us is also very interesting where it has the mix of both client and server level logic on where your data is, and how to find it. This gives us an easy way to get started with a rudimentary sharding setup as well as allowing us to optimize our system further by making our clients a bit more clever.

I am aware that there are still further unknowns in terms of how to actually initialize a Redis cluster setup from scratch, details of how clients interact with a Redis cluster setup, how maintenance/operational side of the cluster setup actually works (e.g. resharding), etc. However, this post is already too long (there you go, my excuse!), and I hope to cover those in the upcoming posts one by one. If you have any specific areas that you are wondering about Redis Cluster, drop a comment below and I will try to cover them if I have any experience around those areas.

Resources

Working with Slices in Go (Golang) - Understanding How append, copy and Slice Expressions Work

Tugberk Ugurlu — Sat, 12 Sep 2020 16:55:00 +0000

Content

Introduction

How Slices Work

How append and copy Works

How Slice Expressions Work

⚠️ Modifying a Sliced-slice Modifies the Original Slice

⚠️ Calling append on a Sliced-slice May Modify the Original Slice

Introduction

Go programming language has two fundamental types at the language level to enable working with numbered sequence of elements: array and slice. At the syntax level, they may look like the same but they are fundamentally very different in terms of their behavior. The most critical fundamental differences are:

Size of the array is fixed and determined at the construction (as you may expect). However, slices can dynamically grow in size (you may wonder how? We will touch on this soon, be patient!)
An array with a specific length is a distinct type based on its length (check this out). Whereas the slice can be represented as one type (e.g. []int)
The in-memory representation of an array type is values laid out sequentially. A slice is a descriptor of an array segment. It consists of a pointer to the array, the length of the segment, and its capacity (we will shortly see what this actually means).
Go's arrays are values, which means that the entire content of the array will be copied when you start passing it around. Slices, on the other hand, a pointer to the underlying array along with the length of the segment. So, when we started passing around a slice, it creates a new slice value that points to the original array, which will be much cheaper to pass around.

Above points highlight some characteristics of slices and how they differ from arrays, but these are mostly differences in terms of how they are structured. More interesting and unobvious differences of slices are around their behaviors around manipulations.

How Slices Work

To be able to understand how slices works, we first need to have a good understanding of how arrays work in Go, and you can check out this informative description to gain more understanding on arrays than the above summary.

Slices are the constructs in Go which give us the flexibility to work with dynamically sized collections. A slice is an abstraction of an array, and it points to a contiguous section of an array stored separately from the slice variable itself. Internally, a slice is a descriptor which holds the following values:

pointer to the backing array (actually, pointer to the array value which indicates 0th index of the slice, which we will cover later)
the length of the segment it's referring to
its capacity (the maximum length of the segment)

There are various ways how you can define a slice in Go, and all of the following ways leads to the same outcome: a slice with a zero length and capacity

func main() {
	var a []int 
	b := []int{}
	c := make([]int, 0)
	fmt.Printf("a: %v, len %d, cap: %d\n", a, len(a), cap(a))
	fmt.Printf("b: %v, len %d, cap: %d\n", b, len(b), cap(b))
	fmt.Printf("c: %v, len %d, cap: %d\n", c, len(c), cap(c))
}

a: [], len 0, cap: 0
b: [], len 0, cap: 0
c: [], len 0, cap: 0

You can also initialize a slice with seed values, and the length of the values here will also be the capacity of the backing array:

func main() {
	a := []int{1,2,3}
	fmt.Printf("a: %v, len %d, cap: %d\n", a, len(a), cap(a))
}

a: [1 2 3], len 3, cap: 3

In case you know the maximum capacity that a slice can grow to, it's best to initialize the slice by hinting the capacity so that you don't have to grow the backing array as you add new values to slice (which we will see how to in the next section). You can do so by passing it to the make builtin function.

func main() {
	a := make([]int, 0, 10)
	fmt.Printf("a: %v, len %d, cap: %d\n", a, len(a), cap(a))
}

a: [], len 0, cap: 10

This still doesn't mean that you can access the backing array freely by index, as the length is still 0. If you attempt to do so, you will get "index out of range" runtime error.

How append and copy Works

When we want to add a new value to an existing slice which will mean growing its length, we can use append, which is a built-in and variadic function. This function appends elements to the end of a slice, and returns the updated slice.

func main() {
	var result []int
	for i := 0; i < 10; i++ {
		if i % 2 == 0 {
			result = append(result, i)
		}
	}
	fmt.Println(result)
}

As you may expect, this prints [0 2 4 6 8] to the console as a result. However, it's not clear here what exactly is happening underneath as a result of invocation of the append function, and what the time complexity of the call is. When we run the below code, things will be a bit more clear to us:

package main

import (
	"fmt"
	"reflect"
	"unsafe"
)

func main() {
	var result []int
	for i := 0; i < 10; i++ {
		if i % 2 == 0 {
			fmt.Printf("appending '%d': %s\n", i, getSliceHeader(&result))
			result = append(result, i)
			fmt.Printf("appended '%d':  %s\n", i, getSliceHeader(&result))
		}
	}
	fmt.Println(result)
}

// https://stackoverflow.com/a/54196005/463785
func getSliceHeader(slice *[]int) string {
	sh := (*reflect.SliceHeader)(unsafe.Pointer(slice))
	return fmt.Sprintf("%+v", sh)
}

appending '0': &{Data:0 Len:0 Cap:0}
appended '0':  &{Data:824633901184 Len:1 Cap:1}
appending '2': &{Data:824633901184 Len:1 Cap:1}
appended '2':  &{Data:824633901296 Len:2 Cap:2}
appending '4': &{Data:824633901296 Len:2 Cap:2}
appended '4':  &{Data:824633803136 Len:3 Cap:4}
appending '6': &{Data:824633803136 Len:3 Cap:4}
appended '6':  &{Data:824633803136 Len:4 Cap:4}
appending '8': &{Data:824633803136 Len:4 Cap:4}
appended '8':  &{Data:824634228800 Len:5 Cap:8}
[0 2 4 6 8]

We can extract the following facts from this result:

nil slice starts off with empty capacity, nothing surprising with that
The capacity of the slice doubles while attempting to append a new item when its capacity and length are equal
When the capacity is doubled, we can also observe that the pointer to the backing array (i.e. the Data field value of reflect.SliceHeader struct) changes

In summary, it's a fair to assume from these facts that the content of the backing array of the slice is copied into a new array which has double capacity than the itself when it's being attempted to append a new item to it while its capacity is full. It should go without saying that the implementation is a bit more complicated than this as you may expect, and this post from Gary Lu does a good job on explaining the implementation details in more details. You can also check out the growSlice function which is used by the compiler generated code to grow the capacity of the slice when needed.

In a nutshell, this is not a good news to us since we are doing too much more work than it's worth. In these cases, initializing the array with the make built-in function is a far better option with a capacity hint based on the max capacity that the slice can grow to:

package main

import (
	"fmt"
	"reflect"
	"unsafe"
)

func main() {
	maxValue := 10
	result := make([]int, 0, maxValue)
	for i := 0; i < maxValue; i++ {
		if i % 2 == 0 {
			fmt.Printf("appending '%d': %s\n", i, getSliceHeader(&result))
			result = append(result, i)
			fmt.Printf("appended '%d':  %s\n", i, getSliceHeader(&result))
		}
	}
	fmt.Println(result)
}

// https://stackoverflow.com/a/54196005/463785
func getSliceHeader(slice *[]int) string {
	sh := (*reflect.SliceHeader)(unsafe.Pointer(slice))
	return fmt.Sprintf("%+v", sh)
}

appending '0': &{Data:824633794640 Len:0 Cap:10}
appended '0':  &{Data:824633794640 Len:1 Cap:10}
appending '2': &{Data:824633794640 Len:1 Cap:10}
appended '2':  &{Data:824633794640 Len:2 Cap:10}
appending '4': &{Data:824633794640 Len:2 Cap:10}
appended '4':  &{Data:824633794640 Len:3 Cap:10}
appending '6': &{Data:824633794640 Len:3 Cap:10}
appended '6':  &{Data:824633794640 Len:4 Cap:10}
appending '8': &{Data:824633794640 Len:4 Cap:10}
appended '8':  &{Data:824633794640 Len:5 Cap:10}
[0 2 4 6 8]

We can observe from the result here that we have been operating over the same backing array with the size of 10, which means that all the append operations have run in O(1) time.

There is also another built-in function which makes it easy to transfer values from one slice to another: copy. I will quote the definition of copy straight from the Go spec:

"The function copy copies slice elements from a source src to a destination dst and returns the number of elements copied. Both arguments must have identical element type T and must be assignable to a slice of type []T. The number of elements copied is the minimum of len(src) and len(dst)."

It's probably obvious, but still worth mentioning that copy runs in O(N) time, where N is the number of elements it can copy. The following example demonstrates copy function in action:

func main() {
	a := make([]int, 5, 6)
	b := []int{1, 2, 3, 4, 5}
	fmt.Println(copy(a, b))
	fmt.Printf("a: %v, cap: %d\n", a, cap(a))
}

5
a: [1 2 3 4 5], cap: 6

How Slice Expressions Work

Slice expressions construct a substring or slice from a string, array, pointer to array, or slice (e.g. a[1:5]). The result has indices starting at 0 and length equal to high - low. This is great, as it gives us an easy way to perform slicing operations on the original value, and this is being performed super efficiently. The reason for this is that slicing does not result in copying the slice's data. It creates a new slice value (i.e. reflect.SliceHeader) that points to the original array (it's actually the pointer to the first element of the new slice):

The following example should be able to demonstrate this behavior for us:

package main

import (
	"fmt"
	"reflect"
	"unsafe"
)

func main() {
	maxValue := 10
	result := make([]int, 0, maxValue)
	for i := 0; i < maxValue; i++ {
		if i % 2 == 0 {
			result = append(result, i)
		}
	}
	for i := range result {
		fmt.Printf("%d: %v\n", i, &result[i])
	}
	newSlice := result[1:3]
	newSlice2 := result[2:4]
	fmt.Printf("[:]: %s\n", getSliceHeader(&result))
	fmt.Printf("[1:3]: %s\n", getSliceHeader(&newSlice))
	fmt.Printf("[2:4]: %s\n", getSliceHeader(&newSlice2))
}

func getSliceHeader(slice *[]int) string {
	sh := (*reflect.SliceHeader)(unsafe.Pointer(slice))
	return fmt.Sprintf("%+v", sh)
}

Let's unpack what we are doing here:

we are creating a slice
filling it with data and printing the hexadecimal representation of a memory address of each value in the slice (so that we can compare these later)
slicing it twice, and assigning the new slices to separate variables
inspecting the header values of each slice

The outcome is as below (you can also run it here):

0: 0xc000012050
1: 0xc000012058
2: 0xc000012060
3: 0xc000012068
4: 0xc000012070
[:]: &{Data:824633794640 Len:5 Cap:10}
[1:3]: &{Data:824633794648 Len:2 Cap:9}
[2:4]: &{Data:824633794656 Len:2 Cap:8}

As a result, we are seeing that [1:3] slice has the length 2 (which is expected). What's interesting is the capacity which is 9. The reason for that is that the capacity assigned to the sliced-slice is influenced by the starting point of the new slice (i.e. low) and the capacity of the original slice cap, and calculated as cap - low, and the rest of the capacity is referring to the same sequential dedicated memory addresses of the backing array. We will see in the next sections what the implications of this behavior can be.

The other interesting thing we are seeing here is that the pointer to the backing array has changed. This is the result of the memory representation of the array. An array is stored as a sequence of n blocks of the type specified. So, the pointer here is actually pointing to the 1st index value of the original array, which we can confirm by comparing the hexadecimal representation of a memory address of each value in the original slice: [1:3] slice is pointing to 824633794648 and 1st indexed value in the original slice is pointing to 0xc000012058 which is the hexadecimal value of 824633794648.

The similar story is there for the [2:4] sliced-slice, too. What we can confirm from this is that slicing is super efficient with the cost of sharing the backing array with the original slice.

⚠️ Modifying a Sliced-slice Modifies the Original Slice

By looking at the internals of how slicing works, we have seen that the new slice, which is returned by slicing an existing slice, is still referring to the same backing array as the original slice. This introduces a very interesting implication that modifying data on the indices of the newly sliced-slice also causes the same modification on the original slice, which can actually cause very hard to track down bugs, and the following code snippet is showing how this can happen:

package main

import (
	"fmt"
)

func main() {
	a := []int{1, 2, 3, 4, 5}
	b := a[2:4]
	b[0] = 10
	fmt.Println(b)
	fmt.Println(a)
}

[10 4]
[1 2 10 4 5]

In this given example, the issue may already be apparent to us. However, this unobvious behavior of how slicing works underneath (to be fair, for the right performance reasons) can make some issues more obfuscated when the slicing and modification is done in different places. For instance, with the following example (which you can also see here), we can see that Result method on the race instance is not returning the expected result anymore due to the modifications done to the slice returned by the Top10Finishers method, because sort.Strings call modified the array which is actually backing the both slices.

package main

import (
	"fmt"
	"play.ground/race"
	"sort"
)

func main() {
	belgium2020Race := race.New("Belgian", []string{
		"Hamilton", "Bottas", "Verstappen", "Ricciardo", "Ocon",
		"Albon", "Norris", "Gasly", "Stroll", "Perez",
		"Kvyat", "Räikkönen", "Vettel", "Leclerc", "Grosjean",
		"Latifi", "Magnussen", "Giovinazzi", "Russell", "Sainz",
	})
	top10Finishers := belgium2020Race.Top10Finishers()
	sort.Strings(top10Finishers)
	fmt.Printf("%s GP top 10 finishers, in alphabetical order: %v\n", belgium2020Race.Name(), top10Finishers)
	fmt.Printf("%s GP result: %v\n", belgium2020Race.Name(), belgium2020Race.Result())
}

-- go.mod --
module play.ground

-- race/race.go --
package race

type race struct {
	name   string
	result []string
}

func (r race) Name() string {
	return r.name
}

func (r race) Result() []string {
	return r.result
}

func (r race) Top10Finishers() []string {
	return r.result[:10]
}

func New(name string, result []string) race {
	return race{
		name:   name,
		result: result,
	}
}

Belgian GP top 10 finishers: [Hamilton Bottas Verstappen Ricciardo Ocon Albon Norris Gasly Stroll Perez]
Belgian GP top 10 finishers, in alphabetical order: [Albon Bottas Gasly Hamilton Norris Ocon Perez Ricciardo Stroll Verstappen]
Belgian GP result: [Albon Bottas Gasly Hamilton Norris Ocon Perez Ricciardo Stroll Verstappen Kvyat Räikkönen Vettel Leclerc Grosjean Latifi Magnussen Giovinazzi Russell Sainz]

There is no one-size-fits-all solution to the the problem here. It will really depend on your usage, and what type of contract you are exposing from your defined type. If you are after creating a domain model encapsulation where you don't want to allow unmodified access to the state of that model, you can instead make a copy of the slice that you want to return, with the cost of extra time and space complexity you are introducing. The following code shows the only modification we would have done to the above example to make this work:

func (r race) Top10Finishers() []string {
	top10 := r.result[:10]
	result := make([]string, len(top10))
	copy(result, top10)
	return result
}

When we execute this version of the implementation, we can now see that the sort.Strings call is not implicitly modifying the original slice:

Belgian GP top 10 finishers: [Hamilton Bottas Verstappen Ricciardo Ocon Albon Norris Gasly Stroll Perez]
Belgian GP top 10 finishers, in alphabetical order: [Albon Bottas Gasly Hamilton Norris Ocon Perez Ricciardo Stroll Verstappen]
Belgian GP result: [Hamilton Bottas Verstappen Ricciardo Ocon Albon Norris Gasly Stroll Perez Kvyat Räikkönen Vettel Leclerc Grosjean Latifi Magnussen Giovinazzi Russell Sainz]

Another option here is to expose a read-only version of the data, which you can achieve by encapsulating the slice behind an interface. This would only allow certain read-only operations, and make it more obvious to the consumer of the package what's the cost of the operation is:

type ReadOnlyStringCollection interface {
	Each(f func(i int, value string))
	Len() int
}

This forces your consumer to iterate over the data first before attempting to manipulate it, which is positive from the point of establishing a much more clear contract from your package. The following is showing how you can implement this inside the race package:

-- race/race.go --
package race

type race struct {
	name   string
	result []string
}

func (r race) Name() string {
	return r.name
}

func (r race) Result() []string {
	return r.result
}

func (r race) Top10Finishers() ReadOnlyStringCollection {
	return readOnlyStringCollection{r.result[:10]}
}

func New(name string, result []string) race {
	return race{
		name:   name,
		result: result,
	}
}

type readOnlyStringCollection struct {
	value []string
}

func (r readOnlyStringCollection) Each(f func(i int, value string)) {
	for i, v := range r.value {
		f(i, v)
	}
}

func (r readOnlyStringCollection) Len() int {
	return len(r.value)
}

type ReadOnlyStringCollection interface {
	Each(f func(i int, value string))
	Len() int
}

The following is how you can now make use of it:

func main() {
	belgium2020Race := race.New("Belgian", []string{
		"Hamilton", "Bottas", "Verstappen", "Ricciardo", "Ocon",
		"Albon", "Norris", "Gasly", "Stroll", "Perez",
		"Kvyat", "Räikkönen", "Vettel", "Leclerc", "Grosjean",
		"Latifi", "Magnussen", "Giovinazzi", "Russell", "Sainz",
	})
	top10Finishers := func() []string {
		result := make([]string, 10)
		top10 := belgium2020Race.Top10Finishers()
		top10.Each(func(i int, val string) {
			result[i] = val
		})
		return result
	}()
	fmt.Printf("%s GP top 10 finishers: %v\n", belgium2020Race.Name(), top10Finishers)	
	sort.Strings(top10Finishers)
	fmt.Printf("%s GP top 10 finishers, in alphabetical order: %v\n", belgium2020Race.Name(), top10Finishers)
	fmt.Printf("%s GP result: %v\n", belgium2020Race.Name(), belgium2020Race.Result())
}

When we execute this version of the implementation, we can now see that the sort.Strings call is not implicitly modifying the original slice in this case, too:

Belgian GP top 10 finishers: [Hamilton Bottas Verstappen Ricciardo Ocon Albon Norris Gasly Stroll Perez]
Belgian GP top 10 finishers, in alphabetical order: [Albon Bottas Gasly Hamilton Norris Ocon Perez Ricciardo Stroll Verstappen]
Belgian GP result: [Hamilton Bottas Verstappen Ricciardo Ocon Albon Norris Gasly Stroll Perez Kvyat Räikkönen Vettel Leclerc Grosjean Latifi Magnussen Giovinazzi Russell Sainz]

⚠️ Calling append on a Sliced Slice May Modify the Original Slice

We have previously went over how append function in Go works by appending elements to the end of a slice, and returning the updated slice. This can sort of give you the impression that the append function is a pure function, and doesn't modify your state. However, we have seen from how append works that it may not be the case. If we combine this with the fact that the new slice, which is returned by slicing an existing slice, is still referring to the same backing array as the original slice, following example demonstrates a behavior which is a bit more unobvious than the above one:

func main() {
	a := []int{1, 2, 3, 4, 5}
	b := a[2:4]
	fmt.Printf("a: %v, cap: %d\n", a, cap(a))
	fmt.Printf("b: %v, cap: %d\n", b, cap(b))
	
	b = append(b, 20)
	fmt.Printf("b: %v, cap: %d\n", b, cap(b))
	fmt.Printf("a: %v, cap: %d\n", a, cap(a))
}

a: [1 2 3 4 5], cap: 5
b: [3 4], cap: 3
b: [3 4 20], cap: 3
a: [1 2 3 4 20], cap: 5

In this example, we have a slice assigned to variable a, and we slice this array to assign a new slice to variable b. We print the results. We see that our sliced-slice has a length of 2, and capacity of 3. All good, expected. However, interesting behavior kicks in when we attempt to append a value to slice b. The append works as expected, and the slice b has the new appended value. Besides that, we can see from the result of our code that slice a has also been modified, and the original value on the 4th index (i.e. 5) is now replaced with 20 which was the value appended to slice b.

The reason for this behavior is exactly reasons we have talked about before. The remaining capacity of the sliced-slice is still also being used by the original slice. To be frankly honest, it would be unfair to call this an unexpected behavior since it's documented in a detailed fashion. Nevertheless, this would be fair to classify this as an entirely unobvious behavior especially if you are new to the language, and expect the code to highlight a bit more of its behavior.

The only time you wouldn't have the same behavior here is where the sliced-slice's capacity was already full right after slicing:

func main() {
	a := []int{1, 2, 3, 4, 5}
	b := a[2:]
	fmt.Printf("a: %v, cap: %d\n", a, cap(a))
	fmt.Printf("b: %v, cap: %d\n", b, cap(b))
	
	b = append(b, 20)
	fmt.Printf("b: %v, cap: %d\n", b, cap(b))
	fmt.Printf("a: %v, cap: %d\n", a, cap(a))
}

a: [1 2 3 4 5], cap: 5
b: [3 4 5], cap: 3
b: [3 4 5 20], cap: 6
a: [1 2 3 4 5], cap: 5

In this example, we can see that the capacity and length of slice b was 3, and calling append on slice b triggered the grow logic which meant that the values had to be copied to a new array with capacity 6 before attempting to append the new value. This eventually meant that original slice was not impacted by the modification.

This behavior is also the case other way around. For instance, take the following example:

package main

import (
	"fmt"
)

func main() {
	a := make([]int, 5, 6)
	copy(a, []int{1, 2, 3, 4, 5})
	fmt.Printf("a: %v, cap: %d\n", a, cap(a))

	b := a[3:]
	fmt.Printf("b: %v, cap: %d\n", b, cap(b))
	b = append(b, 10)
	fmt.Printf("b: %v, cap: %d\n", b, cap(b))

	a = append(a, 20)
	fmt.Printf("a: %v, cap: %d\n", a, cap(a))
	fmt.Printf("b: %v, cap: %d\n", b, cap(b))
}

a: [1 2 3 4 5], cap: 6
b: [4 5], cap: 3
b: [4 5 10], cap: 3
a: [1 2 3 4 5 20], cap: 6
b: [4 5 20], cap: 3

In this case, appending the value 20 to slice a causes the 2nd-indexed value of slice b.

Conclusion

Slice type in Go is a powerful construct, giving us flexibility over Go's array type with as minimum performance hit as possible. This flexibility introduced with least performance impact comes with some additional cost of being implicit on the implications of modifications performed on slices, and these implications can be significant depending on the use case while also being very hard to track down. This post sheds some light on some of these, but I encourage you to spend time on understand how slices really works in depth before making use of them in anger. Besides this post, following resources should also help you

Implementing OrderedMap in Go 2.0 by Using Generics with Delete Operation in O(1) Time Complexity

Tugberk Ugurlu — Sat, 29 Aug 2020 11:33:00 +0000

Probably the most sought after feature of Go programming language, Generics, is on its way and is expected to land with v2. You can check out the proposal here, and have a play with it in Go playground for v2. I stumbled upon rocketlaunchr.cloud's great post on using generics in Go 2, and the post shows how you can implement ordered maps. The post is very informative, and shows you how powerful Generics will be for Go.

However, I noticed a performance issues with the implementation of Delete operation on the OrderedMap struct, and in this post, I want to show a much better implementation in terms of time complexity with a Doubly Linked List data structure, and show the impact of the change.

The Problem

To summarize the current approach in rocketlaunchr.cloud's post, it essentially exposes the below signature:

type OrderedMap[type K comparable, V any] struct {
	store map[K]V
	keys  []K
}

func (o *OrderedMap[K, V]) Get(key K) (V, bool) {
    // ...
}

func (o *OrderedMap[K, V]) Set(key K, val V) {
    // ...
}

func (o *OrderedMap[K, V]) Delete(key K) {
    // ...
}

func (o *OrderedMap[K, V]) Iterator() func() (*int, *K, V) {
    // ...
}

The implementation also gives the FIFO guarantee through the iterator and maintaining the order of the list even after the delete operation (e.g. if the map has 1,2,3,4, and 3 is then deleted, the iterator will output the data with the following order 1,2,4.).

The performance problem with the implementation is with the Delete method, which has the below implementation:

func (o *OrderedMap[K, V]) Delete(key K) {
	delete(o.store, key)

	// Find key in slice
	var idx *int

	for i, val := range o.keys {
		if val == key {
			idx = &[]int{i}[0]
			break
		}
	}
	if idx != nil {
		o.keys = append(o.keys[:*idx], o.keys[*idx+1:]...)
	}
}

The implementation here is iterating over the entire keys slice to perform the delete operation, which has O(N) time complexity and this can be a significant performance hit where there is a need to store large collections while making use of the Delete operation frequently. We can also observe that we are performing a shift in the keys slice as well, with the below operation

if idx != nil {
    o.keys = append(o.keys[:*idx], o.keys[*idx+1:]...)
}

o.keys[:*idx], o.keys[*idx+1:] and append all here have its own time complexity, and depending on how much the backing array needs to grow, the complexity of this operation can grow. That said, I have to admit that the language features here make it hard to reason about the exact time complexity of each operation here.

In the post, I realized that rocketlaunchr.cloud actually calls for an action for readers to address this known issue with the below quote for this :)

Currently, when you delete a key-value pair, you need to iterate over the keys slice to find the index of the key you want to delete. You can use another map that associates the key to the index in the keys slice. I’ll leave it to the reader to implement.

Thinking about this now and it won't actually be enough for us to store the index as it will require us to perform either shift the slice where the keys are stored, or keep the storage for deleted keys and skip them during iteration. Both has its own disadvantages where the first option will have time complexity hit per each delete operation, whereas the second one will both increase the time complexity of iterator as well as increasing the space complexity.

Doubly Linked List Data Structure to Rescue

It's actually possible to reduce the time complexity of the Delete operation to O(1) by changing the way how we store the data within the implementation of OrderedMap struct, without increasing the time complexity of other operations, and needing to change any of its public signature. We can do this by storing the ordered data in a Doubly Linked List data structure, and storing each node in the map as the value, instead of the raw value.

Implementation of a Doubly Linked List data structure should be fairly straight forward to implement. However, Go already has one under container/list package. The only caveat for us with this one is that there is no generic version of this in Go 2 at the moment, as far as I am aware. That said, we actually don't need a generic version of this since we will use this internally within the scope of OrderedMap, and we can store the value as interface{} instead.

Implementation

Let's look at the implementation, going through it step by step. Below we can see how we are changing the internal state storage constructs of the OrderedMap struct as well as its construction:

type OrderedMap[type K comparable, V any] struct {
	store map[K]*list.Element
	keys  *list.List
}

func NewOrderedMap[type K comparable, V any]() *OrderedMap[K, V] {
	return &OrderedMap[K, V]{
		store: map[K]*list.Element{},
		keys:  list.New(),
	}
}

The biggest changes to notice here are:

keys type has changed from []K to *list.List
map type has changed from map[K]V to map[K]*list.Element, where we are now storing the doubly linked list node, instead of the raw value

These changes to the internal storage will impact the implementation of all methods, but without increasing the time complexity, and needing to change the public signature of each method. Below, for example, is how the Set and Get method implementations have changed:

type keyValueHolder[type K comparable, V any] struct {
	key K
	value V
}

func (o *OrderedMap[K, V]) Set(key K, val V) {
	var e *list.Element
	if _, exists := o.store[key]; !exists {
		e = o.keys.PushBack(keyValueHolder[K, V]{
			key: key,
			value: val,
		})
	} else {
		e = o.store[key]
		e.Value = keyValueHolder[K, V]{
			key: key,
			value: val,
		}
	}
	o.store[key] = e
}

func (o *OrderedMap[K, V]) Get(key K) (V, bool) {
	val, exists := o.store[key]
	if !exists {
		return *new(V), false
	}
	return val.Value.(keyValueHolder[K, V]).value, true
}

func (o *OrderedMap[K, V]) Iterator() func() (*int, *K, V) {
	e := o.keys.Front()
	j := 0
	return func() (_ *int, _ *K, _ V) {
		if e == nil {
			return
		}

		keyVal := e.Value.(keyValueHolder[K, V])
		j++
		e = e.Next()

		return func() *int { v := j-1; return &v }(), &keyVal.key, keyVal.value
	}
}

In a nutshell, the difference here is that we are storing the doubly linked list node as the value in the map, and we store both the key and value as the value of the doubly linked list node through keyValueHolder struct. This obviously impacts how we set and get the data, but the time complexity of both the methods stay as O(1). We will actually observe the biggest change with the Delete method here:

func (o *OrderedMap[K, V]) Delete(key K) {
	e, exists := o.store[key]
	if !exists {
		return
	}

	o.keys.Remove(e)

	delete(o.store, key)
}

If we were to break this down, here is what we are doing here:

Accessing the doubly linked list node from the map first, based on the given key. This is O(1) in terms of time complexity.
Calling Remove on the doubly linked list by passing the found element. This is O(1) in terms of time complexity.
Deleting node from the map, based on the given key. This is also O(1) in terms of time complexity.

One more thing to unwrap here is how come calling Remove on the doubly linked list is O(1) in terms of time complexity, and showing its implementation might shed some light on the rationale:

// Remove removes e from l if e is an element of list l.
// It returns the element value e.Value.
// The element must not be nil.
func (l *List) Remove(e *Element) interface{} {
	if e.list == l {
		// if e.list == l, l must have been initialized when e was inserted
		// in l or l == nil (e is a zero Element) and l.remove will crash
		l.remove(e)
	}
	return e.Value
}

// remove removes e from its list, decrements l.len, and returns e.
func (l *List) remove(e *Element) *Element {
	e.prev.next = e.next
	e.next.prev = e.prev
	e.next = nil // avoid memory leaks
	e.prev = nil // avoid memory leaks
	e.list = nil
	l.len--
	return e
}

All that this implementation does is to appropriately remove the links from the node that we want to remove, and wiring up its next node with its previous node (if they exist). You can also see this implementation here in Go source code.

You can also see the whole implementation working here in Go 2 Playground:

0 1 string1 is a string
1 2 string2 is a string
2 4 string4 is a string

Program exited.

Benchmarking to Show the Impact

To be able to understand the improvement we have made here, we can run a benchmark between the two Delete implementations with Go's built-in benchmark tooling. For this, we will go with the below benchmark setup:

Sequential input of an array with 1M items
Seeding the OrderedMap with these 1M items
Deleting a random key for 100K times

One caveat with the benchmark here is that we will run it with non-generic implementations since I wasn't unable to find a way to install Go 2 on my machine to be able to run a benchmark, and I don't believe it's possible to run one with Go playground. That said, this shouldn't change anything for us to be able to understand the difference between two Delete implementations as the logic will stay the same. The only change will be that the usage of the type will not be obvious due to lack of strongly typed signature. I have put the OrderedMap with original Delete implementation here, and with the improved one here.

The code for the benchmark itself is as below (which you can also find here):

package main

import (
	"fmt"
	"math/rand"
	"testing"
)

func BenchmarkOrderedMapLinkedListBasedDelete(b *testing.B) {
	for n := 0; n < b.N; n++ {
		seedCount := 1000000
		m := NewLinkedListBasedOrderedMap()
		for i := 1; i <= seedCount; i++ {
			m.Set(i, fmt.Sprintf("string%d", i))
		}
		for i := 0; i < 100000; i++ {
			m.Delete(rand.Intn(seedCount-1) + 1)
		}
	}
}

func BenchmarkOrderedMapSliceBasedDelete(b *testing.B) {
	for n := 0; n < b.N; n++ {
		seedCount := 1000000
		m := NewOrderedMap()
		for i := 1; i <= seedCount; i++ {
			m.Set(i, fmt.Sprintf("string%d", i))
		}
		for i := 0; i < 100000; i++ {
			m.Delete(rand.Intn(seedCount-1) + 1)
		}
	}
}

We can run this benchmark with go test command. However, note that the exact time that takes to run each function is not significant here, since it will depend on the machine spec, etc. Also, in the test, we are seeding the map implementations for each run, which means extra time we are adding to the exact time to run the function. That said, what's important to observe here is the difference in time between two functions:

➜  ordered-map-generics git:(master) ✗ go test -bench=BenchmarkOrderedMap -benchtime=30s
goos: darwin
goarch: amd64
pkg: github.com/tugberkugurlu/algos-go/ordered-map-generics
BenchmarkOrderedMapLinkedListBasedDelete-4   	      42	 724271767 ns/op
BenchmarkOrderedMapSliceBasedDelete-4        	       1	54509739278 ns/op
PASS
ok  	github.com/tugberkugurlu/algos-go/ordered-map-generics	86.361s

Wow, we can see the orders of magnitude improvement here, and it's very rewarding to see the impact.

Conclusion

The biggest take away from this post I believe is how Generics will change the way we can implement more complex and powerful data structures, and allow them to be reused more effectively by truly taking advantage of strongly-typed nature of Go, as well as how important it's to use the correct data structures for the expected usage of own types. As a side note, I thoroughly enjoyed writing this post, as I was able to feel the power of having Generics available to use in Go!

Usage of the Heap Data Structure in Go (Golang), with Examples

Tugberk Ugurlu — Sun, 23 Aug 2020 13:39:00 +0000

So, Tell Me More About This Heap Data Structure

Heap is one of the most powerful data structures that is in our disposal to solve various real world problems more efficiently. Heap data structure usually comes with two shapes: Min Heap or Max Heap, and depending on which one it is, heap will give you efficient (i.e. O(1)) access to min/max value within the given collection.

Here is the characteristics of the heap data structure, which separate it from other data structure when all of these are combined together:

a tree-based data structure, which is a complete binary tree
In case of max heap, root node of the tree must represent the greatest value within the tree
In case of min heap, root node of the tree must represent the smallest value within the tree
Building a heap over an array of values has the cost of O(n log n) in terms of time complexity (worst case), where n is the length of the original array
Adding/removing a value from an existing heap has the cost of O(log n) in terms of time complexity, where n is the length of the heap

This information should be enough for us to get going for the purposes of this post, but if you want to understand a bit more on how to build a heap data structure, you can check this post out which shows some clever ways of building heap even with an array, instead of a tree.

heap Package in Go

Go is infamous for its lack of generics (which is hopefully changing soon), which makes it hard to implement this type of collection types very hard. That said, Go provides a package called container/heap which has heap operations for any type that implements heap.Interface.

heap.Interface has the below signature:

type Interface interface {
	sort.Interface
	Push(x interface{}) // add x as element Len()
	Pop() interface{}   // remove and return element Len() - 1.
}

As we can see, it embeds the sort.Interface into its signature. So, let's also see what that interface signature looks like:

type Interface interface {
	// Len is the number of elements in the collection.
	Len() int
	// Less reports whether the element with
	// index i should sort before the element with index j.
	Less(i, j int) bool
	// Swap swaps the elements with indexes i and j.
	Swap(i, j int)
}

That's pretty much it. In a nutshell, Go asks us to implement some very basic operations on our own collection such as adding and removing a value, as well as requiring us to implement the sort interface which needs us to check which one of the given two values are less than the other, and doing a swap between two indices within the array. It also "kindly" asks us to perform some casting on behalf of it (ahem, covariance and contravariance, ahem, cough!).

There is still a catch here, since you can't add new methods to types outside package. For instance, the below code where we add methods to []int doesn't really work, with the error message of "Invalid receiver type '[]int' ('[]int' is an unnamed type)":

func (h []int) Push(x interface{}) {
	*h = append(*h, x.(int))
}

We can get around this with a type declaration and attaching the method on that type:

type IntHeap []int

func (h *IntHeap) Push(x interface{}) {
	*h = append(*h, x.(int))
}

You can swap the type int here with your own type, and build the heap structure for that type. For the purposes of this post though, we will continue with the IntHeap type which we have declared above. With that in mind, let's see how the final implementation looks like:

Tip 💡: if you are using Goland IDE, you can hit CMD + ENTER while your cursor is on the type, choose "Implement interface..." option, and then select the interface which you want to implement from the list, it will scaffold the structure of the interface for you:

type IntHeap []int

func (h IntHeap) Len() int {
	return len(h)
}

func (h IntHeap) Less(i, j int) bool {
	return h[i] < h[j]
}

func (h IntHeap) Swap(i, j int) {
	h[i], h[j] = h[j], h[i]
}

func (h *IntHeap) Push(x interface{}) {
	*h = append(*h, x.(int))
}

func (h *IntHeap) Pop() interface{} {
	old := *h
	n := len(old)
	x := old[n-1]
	*h = old[0:n-1]
	return x
}

That's pretty much it. As you can see, all the implementation we had to do is for rudimentary operations, nothing fancy. We can now make use of this by initializing a variable with the given type: h := &IntHeap{}, and then start making use of the heap. functions. Below, you can see a very basic example where we build the heap from a set of values inside an array, and then start printing them (which is essentially the same as performing heap sort):

func main() {
	nums := []int{3,2,20,5,3,1,2,5,6,9,10,4}

	// initialize the heap data structure
	h := &IntHeap{}

	// add all the values to heap, O(n log n)
	for _, val := range nums { // O(n)
		heap.Push(h, val) // O(log n)
	}

	// print all the values from the heap
	// which should be in ascending order
	for i := 0; i < len(nums); i++ {
		fmt.Printf("%d,", heap.Pop(h).(int))
	}
}

The output is the values printed in ascending order, as you expect:

➜  git:(master) ✗ go run main.go
1,2,2,3,3,4,5,5,6,9,10,20,%

A Practical Application of Heap Data Structure

That's great, but what are the real world applications of this data structure? There are a few, and I want to focus on one real world example of this today: finding the k best elements within an unsorted array with the length of n, where the definition of "the best" is the largest value in the array. Seems straight forward! If we are after solving this problem by spending minimum effort, the easiest way is by basically sorting the array and returning the first k elements from it. The code for this would look like as below:

func FindBestKElementsWithSort(nums []int, k int) []int {
	sort.Slice(nums, func(i, j int) bool { // O (n log n)
		return nums[i] > nums[j]
	})

	return func() []int { // O (k)
		result := make([]int, k)
		for i := 0; i < k; i++ {
			result[i] = nums[i]
		}
		return result
	}()
}

We can also test it with the below code, to ensure that the logic works as expected:

package main

import (
	"fmt"
	"testing"
)

var bestElementsTestdata = []struct {
	in  []int
	k   int
	f   func(nums []int, k int) []int
	out []int
}{
	{[]int{3, 2, 1, 5, 6, 4}, 2, FindBestKElementsWithSort, []int{6,5}},
	{[]int{3, 2, 3, 1, 2, 4, 5, 5, 6}, 4, FindBestKElementsWithSort, []int{6,5,5,4}},
}

func TestBestElementsLogic(t *testing.T) {
	for _, tt := range kthElementTestdata {
		t.Run(fmt.Sprintf("%v", tt.in), func(t *testing.T) {
			out := tt.f(tt.in, tt.k)
			if out != tt.out {
				t.Errorf("got %q, want %q", out, tt.out)
			}
		})
	}
}

The time complexity of this is going to be O (n log n + k), which is not bad. However, we can do better with the assumption that k will be smaller than n here. In a real world case, where we want to, for example, find the best top 100 results within a result set of millions, this assumption will be the key part to our optimization.

With that in mind, what we can do instead of directly sorting the array is to maintain a heap with the max length of k, and once we iterate over the entire given list, we can then reverse the result from our min heap. The code for this will look like as below:

func FindBestKElements(nums []int, k int) []int {
	h := &IntHeap{}
	for _, val := range nums { // O(N)
		heap.Push(h, val) // O(log K)
		if h.Len() > k {
			heap.Pop(h) // O(log K)
		}
	}

	return func() []int { // O (k log k)
		result := make([]int, h.Len())
		initialLen := h.Len()
		for i := initialLen; i > 0; i-- {
			result[i-1] = heap.Pop(h).(int)
		}
		return result
	}()
}

We can now extend the original test cases to make sure that our logic works as expected:

package main

import (
	"fmt"
	"testing"
)

var bestElementsTestdata = []struct {
	in  []int
	k   int
	f   func(nums []int, k int) []int
	out []int
}{
	{[]int{3, 2, 1, 5, 6, 4}, 2, FindBestKElements, []int{6,5}},
	{[]int{3, 2, 3, 1, 2, 4, 5, 5, 6}, 4, FindBestKElements, []int{6,5,5,4}},
	{[]int{3, 2, 1, 5, 6, 4}, 2, FindBestKElementsWithSort, []int{6,5}},
	{[]int{3, 2, 3, 1, 2, 4, 5, 5, 6}, 4, FindBestKElementsWithSort, []int{6,5,5,4}},
}

func TestBestElementsLogic(t *testing.T) {
	for _, tt := range kthElementTestdata {
		t.Run(fmt.Sprintf("%v", tt.in), func(t *testing.T) {
			out := tt.f(tt.in, tt.k)
			if out != tt.out {
				t.Errorf("got %q, want %q", out, tt.out)
			}
		})
	}
}

Result:

➜  git:(master) ✗ go test --run=TestBestElementsLogic -v
=== RUN   TestBestElementsLogic
=== RUN   TestBestElementsLogic/[3_2_1_5_6_4]
=== RUN   TestBestElementsLogic/[3_2_3_1_2_4_5_5_6]
=== RUN   TestBestElementsLogic/[3_2_1_5_6_4]#01
=== RUN   TestBestElementsLogic/[3_2_3_1_2_4_5_5_6]#01
--- PASS: TestBestElementsLogic (0.00s)
    --- PASS: TestBestElementsLogic/[3_2_1_5_6_4] (0.00s)
    --- PASS: TestBestElementsLogic/[3_2_3_1_2_4_5_5_6] (0.00s)
    --- PASS: TestBestElementsLogic/[3_2_1_5_6_4]#01 (0.00s)
    --- PASS: TestBestElementsLogic/[3_2_3_1_2_4_5_5_6]#01 (0.00s)
PASS
ok  	_/Users/tugberkugurlu/go/src/github.com/tugberkugurlu/algos-go/kth-largest	0.835s

The time complexity of this is O(n log k + k log k), which is much better.

Benchmarking

But, how much better? To be able to understand the improvement we have made here, we can run a benchmark between the two implementations with Go's built-in benchmark tooling. For this, we will go with the below benchmark setup:

Random input of an array with 10M items
We will use the same set of input across all the runs to be able to make the comparison fair. We will achieve this by using the TestMain hook in Go.
Value of k as 500
For each run, we will make a copy of the array so that we can run the benchmark deterministically since the sort based solution mutates the given array, and we will also do this for the heap based solution to make the comparison fair

The code for this benchmark is as below:

package main

import (
	"math/rand"
	"reflect"
	"testing"
)

var nums []int
func TestMain(m *testing.M) {
	maxVal := 10000000
	nums = make([]int, maxVal)
	for i := 0; i < len(nums); i++ {
		nums[i] = rand.Intn(maxVal)
	}
	m.Run()
}

func BenchmarkFindBestKElementsK500(b *testing.B) {
	k := 500
	for n := 0; n < b.N; n++ {
		nums2 := make([]int, len(nums))
		for i, v := range nums {
			nums2[i] = v
		}
		FindBestKElements(nums2, k)
	}
}

func BenchmarkFindBestKElementsWithSortK500(b *testing.B) {
	k := 500
	for n := 0; n < b.N; n++ {
		nums2 := make([]int, len(nums))
		for i, v := range nums {
			nums2[i] = v
		}
		FindBestKElementsWithSort(nums2, k)
	}
}

We can run this benchmark with go test command. However, note that the exact time that takes to run each function is not significant here, since it will depend on the machine spec, etc. Also, in the test, we are copying the input array for each run, which means extra time we are adding to the exact time to run the function. That said, what's important to observe here is the difference in time between two functions:

➜  git:(master) ✗ go test -bench=BenchmarkFindBestKElements -benchtime=30s 
goos: darwin
goarch: amd64
BenchmarkFindBestKElementsK500-4           	      15	2200446151 ns/op
BenchmarkFindBestKElementsWithSortK500-4   	      13	2618637516 ns/op
PASS
ok  	_/Users/tugberkugurlu/go/src/github.com/tugberkugurlu/algos-go/kth-largest	72.740s

The heap based implementation is about 20% faster than the sort based implementation, which is a significant difference of performance. More importantly, this diff will get worst as the length of the array increases.

Conclusion

Heap is a powerful data structure, which is a perfect to solve some real world problems most efficiently. However, it's often overlooked. Hopefully, this post sheds some light on where this data structure can be useful for us in terms of efficiency, and how Go programming language helps us by providing the necessary ground work to work with data structure even if it's still not at the desirable level in terms of reusability due to lack of generics in the platform (which means that I used all my daily allowance for ranting about lack of generics in Go).

Kafka Core Concepts and Producer Semantics

Tugberk Ugurlu — Tue, 26 May 2020 00:30:00 +0000

Being able to pass data around within a distributed system is the one of the the most crucial aspects of the success for your business, especially when you are dealing with large number of users, reads and writes. It's usual that for a given data write for an entity, you will have N number of read patterns, not just one. Apache Kafka is one of the most effective ways to enable that data distribution within a complex system. I have had the chance to use Kafka at work for more than a year now. However, it has always been implicit and I never needed to understand its intrinsic semantics (standing on the shoulders of giants). I have spent this extended weekend reading the Kafka documentation and running some local examples with Kafka to understand it in details, not just at a high level.

Kafka already has a great documentation, which is very detailed and clear. The intention with this post is not to replicate that document. Instead, it's to pull out bits and pieces which helped me understand Kafka better, and increased by trust. As it has been said in Batman Begins movie (which is one of my all-time favourites): "You always fear what you don't understand", and the main outcome here is to remove that fear :) The post is written by a someone, which is me, who has previous experience with messaging systems such as RabbitMQ, Amazon SQS, and Azure Service Bus. So, I might be overlooking some important aspects which you may also need if you don't have this background. If that's the case, it might be useful to first understand some use cases where Kafka might fit in.

Concepts

Let's first understand some of the high level concepts of Kafka, which will allow us to get started and work on a sample later on. This is by all means not an exhaustive list of concepts in Kafka but will be enough to get us going by allowing us to extract some facts as well as allowing us to make some assumptions with high confidence.

The most important concept of Kafka is a Topic. Topics in Kafka is a place where you can logically group your messages into. When I say logically, I don't mean a schema or anything. You can think of it as just a bucket where your data will end up in the order they appear, and can also be retrieved in the same order (i.e. continually appended to a structured commit log). Topics can be subscribed to by one or more consumers, which we will touch on that a few points later, but this means that Kafka doesn't have exact message queue semantics, which ensures that the data is gone as soon as one consumer processes the data.

These message are called Records, which are durably persisted in the Kafka cluster regardless of the fact that they have been consumed or not. This differentiates Kafka from queuing systems such as RabbitMQ or SQS, where messages vanish after they are being consumed and processed. Using Kafka for storing records permanently is a perfectly valid choice. However, if this is not desired, Kafka also give you a retention configuration options to specific how long you want to hold onto records per topic basis.

The records gets into (i.e. written) a topic through a producer, who are also responsible for choosing which record to assign to which partition within the topic. In other words, data sharding is handled by the clients which publish data to a particular topic. Depending on what client you use, you may have different options on how to distribute data across the partitions, e.g. round robin, your custom sharding strategy, etc.

The records within a specific topic are consumed (i.e. read) by a consumer, which is part of a consumer group. Consumer groups allow records to be processed in parallel by the consumer instances (which are associated to that group, and can live in separate processes or machines) with a guarantee that a record is only delivered to one consumer instance. A consumer instance within a consumer group will own one or more partitions exclusively, which means that you can have at max N number of consumer if you have N partitions.

So, based on these, here are some take aways which I was able to further unpack by following up:

Data stored in Kafka is immutable, meaning that it cannot be updated. In other words, Kafka is working with an append-only data structure and all you can do with it is to ask for the next record and reset to current pointer.

Kafka has a distributed nature to cater your scalability and high availability needs.

Kafka guarantees ordering for the records but this is only per partition basis and how you retry messages can also have an impact on this order. Therefore, it's safest to assume at the consumption level that Kafka won't give you a message ordering guarantees, and you may need to understanding the details of this further depending on how your messages are distributed across the partitions, and how you plan to process that data.

Kafka is consumer driven, which means that consumer is in charge of determining reading the data from which position they like. In practical terms, this means that the consumer can reset the offset and start from wherever it wants to. Check out the Offset Tracking and Consumer Group Management sections for more info on this.

It's possible to add new nodes to your cluster. The data distribution to this node though needs to be triggered manually.

Related to above, you can increase the number of partitions for a given topic. However, this is an operation you do not want to perform without proactively thinking through the consequences since the way you publish data to Kafka might be impacted by this, if, for example, your sharding strategy is rely on knowing the partition count (i.e. hash(key) % number_of_partitions). It's also important to know that Kafka will not attempt to automatically redistribute data in any way. So, this onus is also on you, too.

There is currently no support for reducing the number of partitions for a topic.

Semantics of Data Producing

On the data producing side, we need to know the topic name and the approach we need to use to distribute data across partitions (which is likely that your client will help on this with some out-of-the-box strategies, such as round-robin as guaranteed by Confluent clients). Apart from this, we have quite a few producer level configuration we can apply to influence the semantics of data publishing.

When I am working with messaging systems, the first thing I want to understand is how the message delivery and durability guarantees are influenced, and what the default behaviour is for these. In Kafka, I found that this story a bit more confusing that it should probably be, which is due to a few configuration settings to be aligned to make it work in favour of durability to prevent message loss. Here are some important configuration for this:

acks: This setting indicates the number of acknowledgments the producer requires for a message publishing to be deemed as successful. It can be set to 0, meaning that the producer won't require an ack from any of the servers and this won't give us any guarantees that the message is received by the server. This option could be preferable for cases where we need high throughput at the producing side and the data loss is not critical (e.g. sensor data, where losing a few seconds of data from a source won't spoil our world). For cases where record durability is important, this can be set to all. This means the leader will wait for the full set of in-sync replicas to acknowledge the record, where the minimum number of required in-sync replicas is configured separately.

min.insync.replicas: Quoting from the doc directly: "When a producer sets acks to "all" (or "-1"), min.insync.replicas specifies the minimum number of replicas that must acknowledge a write for the write to be considered successful". This setting is topic level but can also be specified at the broker level. Setting this to the correct amount is really important and it's set to 1 by default, which is probably not what you want if you care about durability of your messages and you have replication factor of >3 for the topic.

flush.messages: In Kafka, messages are immediately written to the filesystem but by default we only fsync() to sync the OS cache lazily. This means that even if we have set our acks and min.insync.replicas to optimise for durability, there is still a theoretical chance that we can lose data with this behaviour. I explicitly said "theoretical" here as it's quite unlikely to lose data with appropriate settings to rely on replication for data durability. For instance, with acks=all and min.insync.replicas=2 settings for a topic which has replication factor of 3, we would be losing data after seeing a data write as successfull in cases of 3 machines (1 leader and 2 replicas) to fail at the same time before having a chance to flush that particular record to the disk, which is pretty unlikely, and this is why Kafka doesn't recommend setting this value as well as
flush.ms value. So, we need to think a bit harder before setting these configuration values as this has some trade-offs to be thought about:
- Durability: Unflushed data may be lost if you are not using replication.
- Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
- Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.

So, a lot to think about here just to get message durability right. The good side of this complexity here is that Kafka is not trying to provide one way to solve all problems, which is not really possible especially when you want to optimise against different aspects (e.g. durability, throughput, etc.) depending the problem at hand. There is some further information on message delivery guarantees in Kafka documentation.

There are some other producer semantics that requires understanding since the consequences of not understanding these might be costly depending on your needs. For example, producer retries is really important to understand correctly as this will have impact on message ordering even within a single partition. Another one is the batch size configuration, which influences how many records to batch into one request whenever multiple records are being sent to the same partition. This might mean that the sends will be performed asynchronously and it may not be suitable for your needs. Finally, the log compaction is another concept which can be really useful to have a prior knowledge on, especially for cases where you publish the current state of an entity to a topic instead of publishing fine-grained events.

Resources

Kafka Acks Explained

Why fsync is bad for Kafka

Does kafka send the acks response to the producer after flush the messages to the disk or just keep them in the memory

Can a message loss occur in Kafka even if producer gets acknowledgement for it?

Distributed Caching in .NET Core with PostSharp and Redis

Tugberk Ugurlu — Wed, 03 Jul 2019 21:48:34 +0000

On my previous post, I walked through the benefits of using PostSharp for caching in a .NET Core server application. However, the example I have showed there would work on a single node application but as we know, probably no application today works on a single node. The benefits of deploying into multiple nodes are multiple such as providing further fault tolerance, and load distribution.

Luckily for us, PostSharp caching backend is modular and the default in-memory one I have used in my previous post can be swapped. One of the out of the box implementations is based on Redis, which is a highly scalable, distributed data structure server solution. One of the widely use cases of Redis is to be used as a ephemeral key/value store to power the caching needs of the apps.

Run Redis Locally

The best way to run Redis locally is through Docker. Let’s run the below code to do this:

docker run --name postsharp-redis -p 6379:6379 -d redis a30f1c1e991e0159fb5f96dfb053f50c50726101907c7f76d319d5e987a6cf3a

We have just got a redis instance up and running on our local environment and exposed it through TCP port mapping to the host machine to be available at port 6379. The final thing we need to do to get this ready for PostSharp usage is to set up the key-space notification to include the AKE events. You can see the Redis notifications document for details on this.

Configure for Redis Cache

First thing to do is to install the NuGet package which contains the Redis caching backend implementation for PostSharp.

dotnet add package PostSharp.Patterns.Caching.Redis --version 6.2.8

Then, all we need to do is to change the caching backend to be the Redis implementation, which we have configured inside our Program.Main method in the previous post:

string connectionConfiguration = "127.0.0.1";
var connection = ConnectionMultiplexer.Connect(connectionConfiguration);
var redisCachingConfiguration = new RedisCachingBackendConfiguration();
CachingServices.DefaultBackend = RedisCachingBackend.Create(connection, redisCachingConfiguration);

Notice the server address we have entered, that points to the Redis instance we have got up and running through Docker and exposed to the host through port mapping. As we used the default Redis port, we didn’t need to state it explicitly.

From this point forward, our app is all ready to run with Redis caching enabled, without a single line of code change on the app components. Only change we had to do was on the configuration side.

For production, it’s worth getting a hold of the Redis server address through a configuration system such as the one provided with ASP.NET Core so that you can swap it based on your environment.

Declarative Coding Approach to Caching in .NET Core with PostSharp

Tugberk Ugurlu — Sat, 04 May 2019 11:14:56 +0000

One of the first criteria of effective code is that it does its job with as few lines of code as possible. Effective code does not repeat itself. Less code in our codebases increases our chances of having less bugs. So, how do we avoid repeating ourselves? We apply our intelligence and abstraction skills to generalize behaviors into methods and classes, the constructs offered by C# to implement abstraction which we call encapsulation. However, some features such as logging or caching cannot be properly encapsulated into a class or method. That’s why you end up having code repetition. C# alone is simply not able to properly encapsulate features like logging, caching, security, INotifyPropertyChanged, undo/redo, etc.

I have been meaning to look into Aspect-oriented programming for a while to help my code to be less noisy without sacrificing the application's acceptable performance and observability. This would help cut right to the business logic, allowing me to care about what's more important. When the topic is Aspect-oriented programming, first software comes to my mind is obviously PostSharp in .NET world and in this post, I will be looking at how PostSharp can help us cut the noise out of our code and showcase this with a sample on data caching.

Getting Started with PostSharp

First of all, let's create our project structure and install PostSharp. I have .NET Core SDK 2.2.202 installed and ran the below commands to create the empty project structure.

dotnet new web --no-https
dotnet new sln
dotnet sln 1-sample-web.sln add 1-sample-web.csproj
dotnet new globaljson

In order to give you an idea about the value proposition of PostSharp, I created this little ASP.NET Core sample which exposes HTTP APIs to read, write and modify the Cars in our system. Some of the code here is contrived such as sleeping for half a second, etc. but we will see why this will be useful for us to see the PostSharp in action.

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.ComponentModel.DataAnnotations;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.DependencyInjection;

namespace _1_sample_web
{
    public class Startup
    {
        public void ConfigureServices(IServiceCollection services)
        {
            services.AddMvc();
        }

        public void Configure(IApplicationBuilder app, IHostingEnvironment env)
        {
            app.UseMvcWithDefaultRoute();
        }
    }

    public class CarsController : Controller
    {
        private static readonly CarsContext _carsCtx = new CarsContext();

        [HttpGet("cars")]
        public IEnumerable Get()
        {
            return _carsCtx.GetAll();
        }

        [HttpGet("cars/{id}")]
        public IActionResult GetCar(int id) 
        {
            var carTuple = _carsCtx.GetSingle(id);
            if (!carTuple.Item1) 
            {
                return NotFound();
            }

            return Ok(carTuple.Item2);
        }

        [HttpPost("cars/{id}")]
        public IActionResult PostCar(Car car) 
        {
            var createdCar = _carsCtx.Add(car);
            return CreatedAtAction(nameof(GetCar), 
                new { id = createdCar.Id }, 
                createdCar);
        }

        [HttpPut("cars/{id}")]
        public IActionResult PutCar(int id, Car car) 
        {
            car.Id = id;
            if (!_carsCtx.TryUpdate(car)) 
            {
                return NotFound();
            }

            return Ok(car);
        }

        [HttpDelete("cars/{id}")]
        public IActionResult DeleteCar(int id) 
        {
            if (!_carsCtx.TryRemove(id)) 
            {
                return NotFound();
            }

            return NoContent();
        }
    }

    public class Car 
    {
        public int Id { get; set; }

        [Required]
        [StringLength(20)]
        public string Make { get; set; }

        [Required]
        [StringLength(20)]
        public string Model { get; set; }

        public int Year { get; set; }

        [Range(0, 500000)]
        public float Price { get; set; }
    }

    public class CarsContext
    {
        private int _nextId = 9;
        private object _idLock = new object();

        private readonly ConcurrentDictionary _database = new ConcurrentDictionary(new HashSet> 
        { 
            new KeyValuePair(1, new Car { Id = 1, Make = "Make1", Model = "Model1", Year = 2010, Price = 10732.2F }),
            new KeyValuePair(2, new Car { Id = 2, Make = "Make2", Model = "Model2", Year = 2008, Price = 27233.1F }),
            new KeyValuePair(3, new Car { Id = 3, Make = "Make3", Model = "Model1", Year = 2009, Price = 67437.0F }),
            new KeyValuePair(4, new Car { Id = 4, Make = "Make4", Model = "Model3", Year = 2007, Price = 78984.2F }),
            new KeyValuePair(5, new Car { Id = 5, Make = "Make5", Model = "Model1", Year = 1987, Price = 56200.89F }),
            new KeyValuePair(6, new Car { Id = 6, Make = "Make6", Model = "Model4", Year = 1997, Price = 46003.2F }),
            new KeyValuePair(7, new Car { Id = 7, Make = "Make7", Model = "Model5", Year = 2001, Price = 78355.92F }),
            new KeyValuePair(8, new Car { Id = 8, Make = "Make8", Model = "Model1", Year = 2011, Price = 1823223.23F })
        });
        
        public IEnumerable GetAll()
        {
            Thread.Sleep(500);
            return _database.Values;
        }

        public IEnumerable Get(Func predicate) 
        {
            Thread.Sleep(500);
            return _database.Values.Where(predicate);
        }

        public Tuple GetSingle(int id) 
        {
            Thread.Sleep(500);

            Car car;
            var doesExist = _database.TryGetValue(id, out car);
            return new Tuple(doesExist, car);
        }

        public Car GetSingle(Func predicate) 
        {
            Thread.Sleep(500);
            return _database.Values.FirstOrDefault(predicate);
        }

        public Car Add(Car car) 
        {
            Thread.Sleep(500);
            lock(_idLock) 
            {
                car.Id = _nextId;
                _database.TryAdd(car.Id, car);
                _nextId++;
            }

            return car;
        }

        public bool TryRemove(int id) 
        {
            Thread.Sleep(500);

            Car removedCar;
            return _database.TryRemove(id, out removedCar);
        }

        public bool TryUpdate(Car car) 
        {
            Thread.Sleep(500);

            Car oldCar;
            if (_database.TryGetValue(car.Id, out oldCar)) {

                return _database.TryUpdate(car.Id, car, oldCar);
            }

            return false;
        }
    }
}

Before going further, let's install PostSharp through NuGet. The first thing you want to install is PostSharp NuGet package which magically hooks into the compilation step thanks to its custom MSBuild scripts. The other package here will be PostSharp.Patterns.Diagnostics as I want to show you a logging example first.

dotnet add package PostSharp
dotnet add package PostSharp.Patterns.Diagnostics

Let's get the sample code from the logging documentation.

using PostSharp.Patterns.Diagnostics;
using PostSharp.Extensibility;

[assembly: Log(AttributePriority = 1, AttributeTargetMemberAttributes = MulticastAttributes.Protected | MulticastAttributes.Internal | MulticastAttributes.Public)]
[assembly: Log(AttributePriority = 2, AttributeExclude = true, AttributeTargetMembers = "get_*" )]

When you run the application now, you will be impressed and probably also be blown away by how much value and observability you get with a very little work!

PostSharp Caching Example

The main reason for me to explore PostSharp is for caching and this is where PostSharp Caching shines really. Let's run our sample application again and perform a mini load test on it.

1..10 | foreach {write-host "$([Math]::Round((Measure-Command -Expression { Invoke-WebRequest -Uri http://localhost:5000/cars }).TotalMilliseconds, 1))"}

You will notice that each call to the "/cars" endpoint takes more than 500ms, which is fair due to us sleeping that amount of time on purpose. However, this could well be the case when you connect to a data store in a real world example. Even if your data store is performant and gets the result instantly, we are still wasting resources here because the data hasn't changed and we would be doing an unnecessary trip to the database to get the data which we have already retrieved previously.

Caching is the solution to this problem. However, it's not really easy to get right on your own in a web application which is multithreaded in its nature. You can use built-in APIs such as the ones come from ASP.NET Core but you then need to express your caching requirements in code, in a verbose way which will make it hard to understand the business logic behind a cluttered codebase and suddenly, you will be struggling to add or modify functionality in an existing software.

Let's see how PostSharp can help us here. First, we need to add the caching support by installing PostSharp.Patterns.Caching NuGet package.

dotnet add package PostSharp.Patterns.Caching

Then, we need to make some changes to our code to enable caching. Here is the git patch which shows you what exactly I have changed:

From a20fc8e95ffd9bf5d424467e0e1283ae5891454a Mon Sep 17 00:00:00 2001
From: Tugberk Ugurlu
Date: Tue, 9 Apr 2019 23:38:32 +0100
Subject: [PATCH] add caching

 postsharp/0-caching/1-sample-web/1-sample-web.csproj | 1 +
 postsharp/0-caching/1-sample-web/Program.cs          | 3 +++
 postsharp/0-caching/1-sample-web/Startup.cs          | 4 +++-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/postsharp/0-caching/1-sample-web/1-sample-web.csproj b/postsharp/0-caching/1-sample-web/1-sample-web.csproj
index bd55b6c..008c486 100644
--- a/postsharp/0-caching/1-sample-web/1-sample-web.csproj
+++ b/postsharp/0-caching/1-sample-web/1-sample-web.csproj
@@ -10,6 +10,7 @@
     
     
     
+    
     
   
 
diff --git a/postsharp/0-caching/1-sample-web/Program.cs b/postsharp/0-caching/1-sample-web/Program.cs
index 3dcae2c..9d241eb 100644
--- a/postsharp/0-caching/1-sample-web/Program.cs
+++ b/postsharp/0-caching/1-sample-web/Program.cs
@@ -7,6 +7,8 @@ using Microsoft.AspNetCore;
 using Microsoft.AspNetCore.Hosting;
 using Microsoft.Extensions.Configuration;
 using Microsoft.Extensions.Logging;
+using PostSharp.Patterns.Caching;
+using PostSharp.Patterns.Caching.Backends;
 using PostSharp.Patterns.Diagnostics;
 using PostSharp.Patterns.Diagnostics.Backends.Console;
 
@@ -18,6 +20,7 @@ namespace _1_sample_web
         public static void Main(string[] args)
         {
             LoggingServices.DefaultBackend = new ConsoleLoggingBackend();
+            CachingServices.DefaultBackend = new MemoryCachingBackend();
             CreateWebHostBuilder(args).Build().Run();
         }
 
diff --git a/postsharp/0-caching/1-sample-web/Startup.cs b/postsharp/0-caching/1-sample-web/Startup.cs
index 18b3dbc..bed37ca 100644
--- a/postsharp/0-caching/1-sample-web/Startup.cs
+++ b/postsharp/0-caching/1-sample-web/Startup.cs
@@ -10,6 +10,7 @@ using Microsoft.AspNetCore.Hosting;
 using Microsoft.AspNetCore.Http;
 using Microsoft.AspNetCore.Mvc;
 using Microsoft.Extensions.DependencyInjection;
+using PostSharp.Patterns.Caching;
 
 namespace _1_sample_web
 {
@@ -115,7 +116,8 @@ namespace _1_sample_web
             new KeyValuePair(7, new Car { Id = 7, Make = "Make7", Model = "Model5", Year = 2001, Price = 78355.92F }),
             new KeyValuePair(8, new Car { Id = 8, Make = "Make8", Model = "Model1", Year = 2011, Price = 1823223.23F })
         });

+        [Cache]
         public IEnumerable GetAll()
         {
             Thread.Sleep(500);
-- 
2.15.2 (Apple Git-101.1)

Couple of things we have done here:

In our entry point, we configured the cache backend we wanted to use which in our case is the MemoryCache.
We marked the CarContext.GetAll method with the CacheAttribute.

Believe it or not, this is pretty much it! When we run the sample mini load test, you will see the dramatic difference even if we are seeing a higher response time on the first load.

Again, very little work but tremendous gain in terms of value!

We have improved our performance drastically but introduced a very nasty problem now: serving stale data. Thankfully, PostSharp has a solution to cache invalidation out of the box without losing our declarative nature for simple cases. For this, we need to use InvalidateCacheAttribute aspect. When this attribute is applied to a method, it causes any call to this method to remove from the cache the value of one or more other methods. It’s worth noting that the cached methods are matched, by type and name, against the parameters of the invalidating method. PostSharp compilation takes care of the rest during the build step to set up all the invalidation logic.

For example, the below changes makes it possible for us to invalidate the cache of a single car entity for example when it’s updated.

From f0889e68e55298e43360e01dd3b0e8b1cf6468e3 Mon Sep 17 00:00:00 2001
From: Tugberk Ugurlu
Date: Tue, 30 Apr 2019 09:40:21 +0100
Subject: [PATCH] cache invalidation, declarative

 postsharp/0-caching/1-sample-web/Startup.cs | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/postsharp/0-caching/1-sample-web/Startup.cs b/postsharp/0-caching/1-sample-web/Startup.cs
index bed37ca..ec95d1e 100644
--- a/postsharp/0-caching/1-sample-web/Startup.cs
+++ b/postsharp/0-caching/1-sample-web/Startup.cs
@@ -62,7 +62,7 @@ namespace _1_sample_web
         public IActionResult PutCar(int id, Car car) 
         {
             car.Id = id;
-            if (!_carsCtx.TryUpdate(car)) 
+            if (!_carsCtx.TryUpdate(id, car)) 
             {
                 return NotFound();
             }
@@ -130,6 +130,7 @@ namespace _1_sample_web
             return _database.Values.Where(predicate);
         }
 
+        [Cache]
         public Tuple GetSingle(int id) 
         {
             Thread.Sleep(500);
@@ -166,7 +167,8 @@ namespace _1_sample_web
             return _database.TryRemove(id, out removedCar);
         }
 
-        public bool TryUpdate(Car car) 
+        [InvalidateCache(nameof(GetSingle))]
+        public bool TryUpdate(int id, Car car) 
         {
             Thread.Sleep(500);
 
-- 
2.20.1 (Apple Git-117)

However, this only invalidates the GetSingle method and we still have problem of serving stale data from GetAll method. There is also an ability out of the box to to imperatively invalidate an item from the cache which is very handy for cases where we cannot simply invalidate the cache purely based on method signature. You can see below an example of how this looks like.

From f629b295fc8f9bbd44904284cb0ec832d51185be Mon Sep 17 00:00:00 2001
From: Tugberk Ugurlu
Date: Tue, 30 Apr 2019 09:55:44 +0100
Subject: [PATCH] cache invalidation, imperatively

 postsharp/0-caching/1-sample-web/Startup.cs | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/postsharp/0-caching/1-sample-web/Startup.cs b/postsharp/0-caching/1-sample-web/Startup.cs
index ec95d1e..8ee6652 100644
--- a/postsharp/0-caching/1-sample-web/Startup.cs
+++ b/postsharp/0-caching/1-sample-web/Startup.cs
@@ -67,6 +67,10 @@ namespace _1_sample_web
                 return NotFound();
             }
 
+            CachingServices.Invalidation.Invalidate(
+                typeof(CarsContext).GetMethod(nameof(CarsContext.GetAll)), 
+                _carsCtx);
+                
             return Ok(car);
         }
 
-- 
2.20.1 (Apple Git-117)

We Invalidate the GetAll method cache on the given CarsContext instance when we have an update on any of the items.

This is all I want to cover on this post in terms of the API surface area of PostSharp and I hope this gives you taste of how simple it’s to get going with PostSharp. PostSharp Caching documentation is also very comprehensive and I recommend you to check that out for further details.

Limitations

The biggest limitation I have seen with PostSharp is its lack of .NET Core compilation support outside of Windows at the time of writing (you may check the current status here). You can run PostSharp on .NET Core, even outside of Windows. However, you first need a Windows machine to be able to compile your code.

Apart from this, there is also a trade off for you to make with PostSharp which is the increased build time. However, with incremental builds, this additional increase can become noticeable. Besides this, compared to the value you got from the tool, I think this is trade-off which is well worth to be made.

Conclusion

This post just touches the surface on what you can achieve with PostSharp. In terms of caching for example, there is even a support for Redis which is very suitable for horizontally scaled web applications where multiple nodes serve HTTP requests.

PostSharp provides help on many other various patterns such as mutithreading. You can get started with PostSharp with PostSharp Essentials, the free but project-size-limited edition.

Software Architecture and System Design - Getting Your Grip and Some Related Resources

Tugberk Ugurlu — Sat, 23 Feb 2019 14:52:48 +0000

If you have never been exposed to software software system design challenges, you might be totally lost on even where to begin. I believe in finding the limits to a certain extend first and then start getting your hands dirty. The way you can start this is by finding some interesting product or services (ideally you are a fan of), and learning about their implementations. You will be surprised that how simple they may look, they most probably involve great deal of complexity. Don’t forget: simple is usually complex and that’s OK™.

Photo by Isaac Smith on Unsplash

I believe the biggest suggestion I can give you while approaching to system design challenges is this: not to assume anything! You should pin down the facts and expectations from this system first. Some good questions to ask here are which will help you start this process:

What is the problem you are trying to solve?
What is the the peak volume of users that will interact with your system?
What are the data write and read patterns going to be?
What are the expected failure cases, how do you plan to mitigate them?
What are the availability and consistency expectations?
Do you need to worry about any auditing, regulation aspects?
What type of sensitive data are you going to be storing?

These are just a questions few that have worked for me and the teams that I worked with over the years. Once you have answers to these questions (or any other which are relevant to the context you are in), then you should be starting to dive into the technical side of the problem.

Setting Your Baseline

What do I mean by the baseline here? Well, in this era of software development, most of the problems "can" be solved by already existing techniques and technologies. Knowing these to a certain extend will give you a head start when you are faced with similar problems. Remember, we are writing software to solve business' and our users' problems and the desire is to do that in a most straight-forward and simple way from a user experience point of view. Why do you need to remember this? It could well be your reality that you should solve problems in unique ways as you might be thinking "what's the point of me writing software then if I am here to follow a pattern?". The craft here is in the decision making process to define where to do what. Surely, we may have challenging, unique problems which we can face at certain times. However, if we have our baseline solid, we will surely know whether we should direct our efforts into finding out ways to solve the problems or further understand the depth of it.

I believe I have convinced you at this point now that having a solid knowledge on how some of the exciting systems are architecturally shaped is quite critical for you to progress on having some appreciation on the craft and a solid baseline.

OK, but where to start? Donne Martin has a GitHub repo called system-design-primer which helps you learn how to design large-scale systems and also prep for the system design interviews. Inside this, there is a section dedicated to real world architectures which also involves some system designs of well-known companies such as Twitter, Uber, etc.

However, before jumping into this, you might want to have some insights on what matters the most in the architectural challenges. This is important because there are A LOT of aspects involved in disambiguating a gnarly, ambiguous problem and solving it within the guidelines of a defined system. Jackson Gabbard, an ex-Facebook employee, has a 50 mins video on system design interviews based on his experience on interviewing hundreds of candidates at Facebook. Even if this is focused on the system design interview objective and what success looks like for that, it's still a very comprehensive resource on what matters the most when it comes to system design. There is also a write-up of this video.

Start Building up Your Data Storage and Retrieval Knowledge

Most of the time, the choice of how you decide to persist and serve data will play a crucial role on the performance of your system. Therefore, you should be able to understand the expectations around data writes and reads about your system first. Then, you should be able to assess these and convert that assessment into a choice. However, you can only do this effectively if you know the existing storage patterns. This essentially means having a good knowledge around database choices.

Databases are really scalable and durable data structures. So, all your knowledge around data structures should be really beneficial around understanding the various database choices. For example, Redis is a data structures server, supporting different kinds of values. It allows you to work with the concept of data strictures such as sets and lists, and provides you to read data through commonly-known algorithms such as LRU in a durable and highly available fashion.

Photo by Samuel Zeller on Unsplash

Once you get enough grip around the various data storage patterns, it's now time for you to get into data consistency and availability land. CAP theorem is the first thing you should try to have a good grip of, which you can polish it off by looking deeper into established consistency and availability patterns. These will allow you to have a wide spectrum when it comes to understanding data writes and reads are really very separate concerns and have separate challenges associated to them. By embracing several consistency and availability patterns, you can gain a lot of performance while serving the data to your applications.

Finally around data storage needs, you should also be aware of caching. Should it be both on the client and server? What data will you cache? And why? How will you invalidate the cache? (will it be based on time? If so, how long?). This section of system-design-primer should be a good starting point on this topic.

Communication Patterns

Systems are composed of various components, which can be different processes living inside the same physical node or different machines sitting at the separate parts in your network. Some of these resources might be private within your network but some needs to be accessed publicly by your consumers.

These resources needs to be able to communicate between them and to the outside world. In context of system design, this again introduces another set of unique challenges. Understanding how asynchronous workflows can help you and what are the various communication patterns available such as TCP, UDP, HTTP (which sits on top of TCP), etc. will help you understand the breadth of the problem space and solutions currently available.

Photo by Tony Stoddard on Unsplash

When dealing with communication to the outside world, security is always another side-effect that you need to be aware of and actively deal with.

Connection Distribution

I am not sure if this logical grouping makes sense here. I will go with it anyway since it’s the closest term that reflects what I want to cover here.

Systems are formed by gluing multiple components together, and how they communicate with each other often is designed through well-established protocols such as TCP and UDP. However, these protocols are often not enough on their own to cover the needs of today’s systems which can have high load and demands from our consumers. We often need ways to be able to distribute connections in order to handle the high load of our system.

Domain Name System (DNS) sits at the core of this distribution. A DNS translates a domain name such as www.example.com to an IP address. Besides this, some DNS services can route traffic through various methods such as weighted round robin and latency-based to help distribute the load.

Load balancing is very vital and nearly every major system on the Web we interact with today sits behind one or multiple load balancers. Load balancers help us distribute incoming client requests to multiple instances of resources. There both hardware and software forms of load balancers but it’s often that you see software based ones used such as HAProxy and ELB. Reverse proxies are also very smilar to the concept of load balancing with some distinctive differences though. These differences will have an effect on your choice based the needs.

Content Delivery Networks (CDN) are also something which you should be aware of. A CDN is a globally distributed network of proxy servers, serving content from locations closer to the user. CDNs are usually preferred when you are serving static files such as JavaScript, CSS and HTML. It’s also common that you see cloud services offer traffic managers (such as Azure Traffic Manager) which gives you global distribution and reduced latency benefits for your dynamic content. However, these services are mostly beneficial if you have stateless web services.

What About My Business Logic? Structuring Business Logic, Workflows and Components

Thus far, we talked about all the infrastructure related aspects of a system. These are the parts of your system which your users probably have no idea about and to be frank, they don't give a damn about them. What they care about is how they interact with your system, what they can achieve by doing so and how the system acts on behalf of them to make certain decisions and process their data.

As you might guess from this post’s title, I intended this blog post to be about software architecture and system design. Therefore, I wasn’t going to cover the software design patterns which are concerned with how the components are built. However, thinking about this more and more, it’s clear to me that the line between them are very blurred and usually both sides are interconnected. Take Event Sourcing for example. Once you adopt this software architecture pattern, it pretty much effects most parts of your system; how you persist data, what level consistency you choose for your system’s clients to deal with, how you shape the components within your system, so on and so forth. Therefore, I decided to touch on some of the design and architectural patterns related which directly concerns your business logic. Even if it’s going to be just touching the surface, it should be useful for you have some ideas. Here is a few of them:

Collaboration Approaches

It's highly unlikely that you are going to be the only one involved in a project where you need to be part of a system design process. Therefore, you need to be able to collaborate with other folks in your team, both inside and outside of your job function. There is also a breadth and depth of this surface area and as the technical leader, you should be able to address the concerns on each level by going into it with a required depth. The activities here may involve evaluating technology choices together or pinning down the business needs and understanding how the work needs to be parallelised.

Photo by Kaleidico on Unsplash

First and foremost, you need to have an accurate and shared understanding of what you are trying to achieve as a business goal and what moving parts involved in this aspect. Group modeling techniques such as event storming are powerful methods to accelerate this process and increases your changes of success. You may get into this process before or after you define your service boundaries, deepening on your product/service maturity stage. Based on the level of alignment you see here, you may want to facilitate a separate activity to define the Ubiquitous Language for the bounded context you are operating on. When it comes to communicating the architecture of your system, you may find the C4 model for software architecture from Simon Brown useful, especially when it comes to understanding what level of depth you should go into while visualising what you are trying to convey.

There are most probably other mature techniques available in this space. However, all will tie back to your domain understanding and your experience and knowledge around Domain-driven Design will prove to be handy.

Some Other Resources

Here are some resources which may help you. These are not in any particular oder.

Pulling an Old Article From the Coffin: SignalR with Redis Running on a Windows Azure Virtual Machine

Tugberk Ugurlu — Wed, 08 Aug 2018 14:32:00 +0000

Long time ago (about 5 years, at least), I contributed an article to SignalR wiki about scaling a SignalR application with Redis. You can still find the article here. I also blogged about it here. However, over time, pictures got lost there. I got a few requests from my readers to refresh those images and I was lucky enough to be able to find them :) I decided to publish that article here so that I would have a much better control over the content. So, here is the post :)

Please keep in mind that this is a really old post and lots of things have evolved since then. However, I do believe the concepts still resonate and it’s valuable to show the ways of how to achieve this within a cloud provider’s context.

SignalR with Redis Running on a Windows Azure Virtual Machine

This wiki article will walk your through on how you can run your SignalR application in multiple machines with Redis as your backplane using Windows Azure Virtual Machines for scale out scenarios.

Creating the Windows Azure Virtual Machines

First of all, we will spin up our virtual machines. What we want here is to have two Windows Server 2008 R2 virtual machines for our SignalR application and we will name them as Web1-08R2 and Web2-08R2. We will have the IIS installed on both of these servers and at the end, we will load balance the request on port 80.

Our third virtual machine will be another Windows Server 2008 R2 only for our Redis server. We will call this server Redis-08R2.

To spin up the VMs, go to new Windows Azure Management Portal and hit New icon at the bottom-right corner.

Creating a virtual machine running Windows Server 2008 R2 is explained here in details. We followed the same steps to create our first VM named Web1-08R2.

The second VM we will be creating has a slightly different approach than the first one. Under the hood, every virtual machine is a cloud service instance and we want to put our second VM (Web2-08R2) under the same cloud service that our first web VM is running under. To do that, we need to follow the same steps as explained inside the previously mentioned article but when we come to 3rd step in the creation wizard, we should chose Connect to existing Virtual Machine option this time and we should choose our first VM we have just created.

As the last step, we now need to create our redis VM which will be named Redis-08R2. We will follow the same steps as we did when we were creating our second web VM (Web2-08R2).

Setting Up Redis as a Windows Service

To use Redis on a Windows machine, we went to Redis on Windows prototype GitHub page and cloned the repository and followed the steps explained under How to build Redis using Visual Studio section.

After you build the project, you will have all the files you need under msvs\bin\release path as zip files. redisbin.zip file will contain the redis server, redis command line interface and some other stuff. rediswatcherbin.zip file will contain the msi file to install redis as a windows service. You can just copy those zip files to your Redis VM and extract redisbin.zip under c:\redis\bin. Then follow the steps:

Currently, there is a bug in the RedisWatcher installer and if you don't have Microsoft Visual C++ 2010 Redistributable Package installed on your machine, the service won't start. So, I installed it first.
Copy this redis.conf file and put it under c:\redis\bin directory. Open it up and add a password by adding the following line of code:
requirepass 1234567
Take this note into considiration when you are setting up your redis password:
Warning: since Redis is pretty fast an outside user can try up to 150k passwords per second against a good box. This means that you should use a very strong password otherwise it will be very easy to break.
Then, extract the rediswatcherbin.zip somewhere and run the InstallWatcher.msito install the service.
Navigate to C:\Program Files (x86)\RedisWatcher directory. You will see a file named watcher.conf inside this directory. Open this file up and replace the entire file with the following text. Only difference here is that we are supplying the redis.conf file directory for the server to use:
```
exepath c:\redis\bin
exename redis-server.exe

{
 workingdir c:\redis\inst1
 runmode hidden
 saveout 1
 cmdparms c:\redis\bin\redis.conf
}
```
Create a folder named inst1 under c:\redis because we have specified this folder as working directory for our redis instance.
When you do a search against windows services in PowerShell, you will see RedisWatcherSvc service is installed.

Run the following PowerShell command to start the service for the first time.
```
(Get-Service -Name RedisWatcherSvc).Start()
```

Now we have a Redis server running on our VM. To test if it is actually running, open up a windows command window under c:\redis\bin and run the following command (assuming you set your password 1234567):

redis-cli -h localhost -p 6379 -a 1234567

Now, you have a redis client running.

Ping the redis to see if you are really authenticated:

Now, we are nearly set. As a last step in our redis server, we need to open up TCP port 6379 for external communication. You can do this under Windows Firewall with Advanced Security window as explained here.

Communicating Through Internal Endpoints Between Windows Azure Virtual Machines Under Same Cloud Service

When you are inside one of your web VMs, you can simply look up the redis VM by hostname.

The hostname will resolve to DIP (Dynamic IP Address) which Windows Azure will use internally. We can configure public endpoints through Windows Azure Management Portal easily but in that case, we would be opening redis to the whole world. Also, if we communicate to our redis server through VIP (Virtual IP Address), we would always go through the load balancer which has its own additional cost.

So, we can easily connect to our redis server from any other connected VM by hostname.

The SignalR Application with Redis

Our SignalR application will not be that much different from a normal SignalR application thanks to SignalR.Redis project. All you need to do is to add the SignalR.Redis nuget package into your application and configure SignalR to use Redis as the message bus inside the Application_Start method in Global.asax.cs file:

protected void Application_Start(object sender, EventArgs e)
{
    // Hook up redis
    string server = ConfigurationManager.AppSettings["redis.server"];
    string port = ConfigurationManager.AppSettings["redis.port"];
    string password = ConfigurationManager.AppSettings["redis.password"];

    GlobalHost.DependencyResolver.UseRedis(server, Int32.Parse(port), password, "SignalR.Redis.Sample");
}

For our demo, the AppSettings should look like as below:

<appSettings>
    <add key="redis.server" value="Redis-08R2" />
    <add key="redis.port" value="6379" />
    <add key="redis.password" value="1234567" />
</appSettings>

I put the application under IIS on our both web servers (Web1-08R2 and Web2-08R2) and configured them to run under .NET Framework 4.0 integrated application pool.

For this demo, I am using the Redis.Sample chat application included inside the SignalR.Redis project.

Let's test them quickly before going public. I fired the both web applications inside the servers and here is the result:

Perfectly running! Let's open them up to the world.

Opening up the Port 80 and Load Balancing the Requets

Our requirement here is to make our application reachable over HTTP and at the same time, we want to load balance the request between our two web servers.

To do that, we need to go to Windows Azure Management portal and set up the TCP endpoints for port 80.

First, we navigate to dashboard of our Web1-08R2 VM and hit Endpoints from the dashboard menu:

From there, hit the End Endpoint icon at the bottom of the page:

A wizard is going to appear on the screen:

Click the right-arrow icon and go to next step which is the last one and we will enter the port details there:

After that, our endpoint will be created:

Follow the same steps of Web2-08R2 VM as well and open the Add Endpoint wizard. This time, we will be able to select Load-balance traffic on an existing port. Chose the previously created port and continue:

At the last step, enter the proper details and hit save:

We will see our new endpoint is being crated but this time Load Balanced column indicates Yes.

As we configured our web applications without a host name and they are exposed through port 80, we can directly run reach our application through the URL or Public Virtual IP Address (VIP) which is provided to us. When we run our application, we should see it running as below:

No matter which server it goes, the message will be broadcasted to every client because we will be using Redis as a message bus.

References

Graph Depth-First Search (DFS)

Tugberk Ugurlu — Sat, 28 Jul 2018 12:32:00 +0000

A while ago, I have written up on Graphs and gave a few examples about their application for real world problems. I absolutely love graphs as they are so powerful to model the data for several key computer science problems. In this post, I want to talk about one of the most common graph algorithms, Depth-first search (DFS) and how and where it could be useful.

What is Depth-First Search (DFS)?

DFS is a specific algorithm for traversing and searching a graph data structure. Depending on the type of graph, the algorithm might differ. However, the idea is actually quite simple for a Directed Acyclic Graph (DAG):

You start with a source vertex (let's call it "S")
You visit the first neighbour vertex of that node (let's call this "N")
You do the same for "N" and you keep going till you end up at a leaf vertex (L) (which is a vertex that has no edges to another vertex)
Then you visit the second neighbour of L's parent vertex.
You would be once you exhaust all the vertices.

I must admit that this is a bit simplified version of the algorithm even for a DAG. For instance, we didn't touch on the fact that we might end up actually visiting the same vertex multiple times if we don't take this into account in our algorithm. There is a really good visualization of this algorithm here where you can observe how the algorithm works in a visual way through a logical graph representation.

Application of Depth-First Search

There are various applications of DFS which are used to solve particular problems such as Topological Sorting and detecting cycle in a graph. There are also occasions where DFS is used as part of another known algorithm to solve a real world problem. One example to that is the Tarjan’s Algorithm to find Strongly Connected Components.

This is also a good resource which lists out different real world applications of DFS.

Other Graph Traversal Algorithms

As you might guess, DFS is not the only known algorithm in order to traverse a graph data structure. Breadth -First Search (BFS) is a another most known graph traversal algorithm which has the similar semantics to DFS but instead of going in depth on a vertex, it prefers visit the all the neighbors of the current vertex. Bidirectional search is another one of the traversal algorithms which is mainly used to find a shortest path from an initial vertex to a goal vertex in a directed graph.

Setting up a MongoDB Replica Set with Docker and Connecting to It With a .NET Core App

Tugberk Ugurlu — Wed, 31 Jan 2018 10:10:00 +0000

Easily setting up realistic non-production (e.g. dev, test, QA, etc.) environments is really critical in order to reduce the feedback loop. In this blog post, I want to talk about how you can achieve this if your application relies on MongoDB Replica Set by showing you how to set it up with Docker for non-production environments.

Hold on! I want to watch, not read!

I got you covered there! I have also recorded a ~5m covering the content of this blog post, where I also walks you through the steps visually. If you find this option useful, let me know through the comments below and I can aim harder to repeat that :)

What are we trying to do here and why?

If you have an application which works against a MongoDB database, it’s very common to have a replica set in production. This approach ensures the high availability of the data, especially for read scenarios. However, applications mostly end up working against a single MongoDB instance, because setting up a Replica Set in isolation is a tedious process. As mentioned at the beginning of the post, we want to reflect the production environment to the process of developing or testing the software applications as much as possible. The reason for that is to catch unexpected behaviour which may only occur under a production environment. This approach is valuable because it would allow us to reduce the feedback loop on those exceptional cases.

Docker makes this all easy!

This is where Docker enters into the picture! Docker is containerization technology and it allows us to have repeatable process to provision environments in a declarative way. It also gives us a try and tear down model where we can experiment and easily start again from the initial state. Docker can also help us with easily setting up a MongoDB Replica Set. Within our Docker Host, we can create Docker Network which would give us the isolated DNS resolution across containers. Then we can start creating the MongoDB docker containers. They would initially be unaware of each other. However, we can initialise the replication by connecting to one of the containers and running the replica set initialisation command. Finally, we can deploy our application container under the same docker network.

There are a handful of advantages to setting up this with Docker and I want to specifically touch on some of them:

It can be automated easily. This is especially crucial for test environments which are provisioned on demand.
It’s repeatable! The declarative nature of the Dockerfile makes it possible to end up with the same environment setup even if you run the scripts months later after your initial setup.
Familiarity! Docker is a widely known and used tool for lots of other purposes and familiarity to the tool is high. Of course, this may depend on your development environment

Let’s make it work!

First of all, I need to create a docker network. I can achieve this by running the "docker network create” command and giving it a unique name.

docker network create my-mongo-cluster

The next step is to create the MongoDB docker containers and start them. I can use “docker run” command for this. Also, MongoDB has an official image on Docker Hub. So, I can reuse that to simplify the acqusition of MongoDB. For convenience, I will name the container with a number suffix. The container also needs to be tied to the network we have previously created. Finally, I need to specify the name of the replica set for each container.

docker run --name mongo-node1 -d --net my-mongo-cluster mongo --replSet “rs0"

First container is created and I need to run the same command to create two more MongoDB containers. The only difference is with the container names.

docker run --name mongo-node2 -d --net my-mongo-cluster mongo --replSet "rs0"
docker run --name mongo-node3 -d --net my-mongo-cluster mongo --replSet “rs0"

I can see that all of my MongoDB containers are at the running state by executing the “docker ps” command.

In order to form a replica set, I need to initialise the replication. I will do that by connecting to one of the containers through the “docker exec” command and starting the mongo shell client.

docker exec -it mongo-node1 mongo

As I now have a connection to the server, I can initialise the replication. This requires me to declare a config object which will include connection details of all the servers.

config = {
      "_id" : "rs0",
      "members" : [
          {
              "_id" : 0,
              "host" : "mongo-node1:27017"
          },
          {
              "_id" : 1,
              "host" : "mongo-node2:27017"
          },
          {
              "_id" : 2,
              "host" : "mongo-node3:27017"
          }
      ]
  }

Finally, we can run “rs.initialize" command to complete the set up.

You will notice that the server I am connected to will be elected as the primary in the replica set shortly. By running “rs.status()”, I can view the status of other MongoDB servers within the replica set. We can see that there are two secondaries and one primary in the replica set.

.NET Core Application

As a scenario, I want to run my .NET Core application which writes data to a MongoDB database and start reading it in a loop. This application will be connecting to the MongoDB replica set which we have just created. This is a standard .NET Core console application which you can create by running the following script:

dotnet new console

The csproj file for this application looks like below.

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp2.0</TargetFramework>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Bogus" Version="18.0.2" />
    <PackageReference Include="MongoDB.Driver" Version="2.4.4" />
    <PackageReference Include="Polly" Version="5.3.1" />
  </ItemGroup>
</Project>

Notice that I have two interesting dependencies there. Polly is used to retry the read calls to MongoDB based on defined policies. This bit is interesting as I would expect the MongoDB client to handle that for read calls. However, it might be also a good way of explicitly stating which calls can be retried inside your application. Bogus, on the other hand, is just here to be able to create fake names to make the application a bit more realistic :)

Finally, this is the code to make this application work:

partial class Program
{
    static void Main(string[] args)
    {
        var settings = new MongoClientSettings
        {
            Servers = new[]
            {
                new MongoServerAddress("mongo-node1", 27017),
                new MongoServerAddress("mongo-node2", 27017),
                new MongoServerAddress("mongo-node3", 27017)
            },
            ConnectionMode = ConnectionMode.ReplicaSet,
            ReplicaSetName = "rs0"
        };

        var client = new MongoClient(settings);
        var database = client.GetDatabase("mydatabase");
        var collection = database.GetCollection<User>("users");

        System.Console.WriteLine("Cluster Id: {0}", client.Cluster.ClusterId);
        client.Cluster.DescriptionChanged += (object sender, ClusterDescriptionChangedEventArgs foo) => 
        {
            System.Console.WriteLine("New Cluster Id: {0}", foo.NewClusterDescription.ClusterId);
        };

        for (int i = 0; i < 100; i++)
        {
            var user = new User { Id = ObjectId.GenerateNewId(), Name = new Bogus.Faker().Name.FullName() };
            collection.InsertOne(user);
        }

        while (true)
        {
            var randomUser = collection.GetRandom();
            Console.WriteLine(randomUser.Name);

            Thread.Sleep(500);
        }
    }
}

This is not the most beautiful and optimized code ever but should demonstrate what we are trying to achieve by having a replica set. It's actually the GetRandom method on the MongoDB collection object which handles the retry:

public static class CollectionExtensions 
{
    private readonly static Random random = new Random();

    public static T GetRandom<T>(this IMongoCollection<T> collection) 
    {
        var retryPolicy = Policy
            .Handle<MongoCommandException>()
            .Or<MongoConnectionException>()
            .WaitAndRetry(2, retryAttempt => 
                TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)) 
            );

        return retryPolicy.Execute(() => GetRandomImpl(collection));
    }

    private static T GetRandomImpl<T>(this IMongoCollection<T> collection)

    {
        return collection.Find(FilterDefinition<T>.Empty)
            .Limit(-1)
            .Skip(random.Next(99))
            .First();
    }
}

I will run this through docker as well and here is the dockerfile for this:

FROM microsoft/dotnet:2-sdk

COPY ./mongodb-replica-set.csproj /app/
WORKDIR /app/
RUN dotnet --info
RUN dotnet restore
ADD ./ /app/
RUN dotnet publish -c DEBUG -o out
ENTRYPOINT ["dotnet", "out/mongodb-replica-set.dll"]

When it starts, we can see that it will output the result to the console:

Prove that It Works!

In order to demonstrate the effect of the replica set, I want to take down the primary node. First of all, we need to have look at the output of rs.status command we have previously ran in order to identify the primary node. We can see that it’s node1!

Secondly, we need to get the container id for that node.

Finally, we can kill the container by running the “docker stop command”. Once the container is stopped, you will notice that application will gracefully recover and continue reading the data.

Speaking at SQL in the City 2017, Register Now!

Tugberk Ugurlu — Tue, 05 Dec 2017 18:27:00 +0000

I'm quite happy to tell you that I'll be speaking at SQL in the City 2017 on the 13th of December about latest SQL Compare features and support for SQL Server 2017 with my colleague and fellow MVP, Steve Jones.

SQL in the City Redgate's annual virtual event and this year's livestream event focuses on enabling you to be more productive. Technical sessions will dive into the latest Microsoft SQL Server releases, and cover topical issues such as data compliance, protection & privacy.

This year's agenda is full of really great sessions from enabling DevOps for databases by automating your deployments to rapid (and magic!) database provisioning with SQL Clone. This year Data Platform MVPs Steve Jones and Grant Fritchey will be joined by Kathi Kellenberger, Editor of Simple Talk and many more who are behind the great tools we build!

Register Now!

Register now to confirm your attendance, and be the first to get access to Grant Fritchey’s new eBook, SQL Server Execution Plans, when it's released in 2018.

Understanding Graphs and Their Application on Software Systems

Tugberk Ugurlu — Tue, 19 Sep 2017 17:21:00 +0000

Lately, I wanted to spend a little bit time on going back to fundamental computer science concepts. Hopefully, I will be able to write about these while I am looking into them in order to offload the knowledge from my brain to the magic hands of the Web :) I am going to start with Graphs, specifically Depth First Traversal (a.k.a. Depth First Search or DFS) and Breadth First Traversal (a.k.a Breadth First Search or BFS). However, this post is only about the definition of Graph and its application in software systems.

What is a Graph?

I am sure you are capable of Googling what a Graph is and ironically maybe that’s why you are reading this sentence now. However, I am not going to put the fancy explanation of a Graph here. Wikipedia already has a great definition on a Graph which can be useful to start with.

Let’s start with a picture:

This is a graph and there are some unique characteristics of this which makes it a graph.

Vertices (a.k.a. Nodes): Each circle with a label inside the above picture is called a vertex or node. They are fundamental building blocks of a graph.
Edges (a.k.a. Arc, Line, Link, Branch): A line that joins two vertices together is called as edge. An edge could be in three forms: undirected and directed. We will get to what these actually mean.

At this point you might be asking what is the difference between a graph and a tree? A tree is actually a graph with some special constraints applied to. A few of these that I know:

A tree cannot contain a cycle but a graph can (see the A, B and E nodes and their edges inside the above picture).
A tree always has a specific root node, whereas you don’t have this concept with a graph.
A tree can only has one edge between its two nodes whereas we can have unidirectional and bidirectional edges between nodes within a graph

I am sure there are more but I believe these are the ones that matter the most.

As we can see with the tree example, graphs comes in many forms. There are many types of graphs and each type has its own unique characteristics and real world use cases. Undirected and directed graphs are two of these types as I briefly mentioned while explaining the edges. I believe the best example to describe the difference between them is to have a look at the fundamental concept of Facebook and Twitter.

Application of Graphs

Graphs are amazing, I absolutely love the concept of a graph! Everyone interacts with a system everyday which somehow makes use of graphs. Facebook, Google Maps, Foursquare, the fraud check system that your bank applies are all making use of a graph and there are many, many more. One application of graph concept which I love is a recommendation engine. There are many forms of this but a very basis one is called Collaborative Filtering. At its basis, it works under a notion of “Steve and Mark liked BMW, Mercedes and Toyota, you like BMW and Toyota, and you may like Mercedes, too?”.

There are some really good graph databases with their own query languages as well. One that I love about is Neo4j which uses Cypher query language to make its data available to be consumed. On their web site, there are a few key applications of Neo4j listed and they are fundamentally real world applications of the graph concept.

You can also come across some interesting problems in the space of mathematics which has solutions based on a type of graph like Seven Bridges of Königsberg problem (and I think this problem is the cornerstone in the history of graph theory).

Here is My Proudest Achievement, What is Yours?

Tugberk Ugurlu — Mon, 18 Sep 2017 16:59:00 +0000

It's very common that you get asked about your proudest achievement. I wanted to put mine here publicly so that I would have a place to direct people to. So, here it is :)

My proudest achievement to this day dates back to 2010. I was working at a local Travel Agency in Turkey while still studying Travel Management at the university and we had a Web site for our customers to book their airport transfers from/to their hotels by paying online. However, the application didn't allow our customers to book additional services with extra cost such as baby booster seat. In addition to this, we were unable to reflect our pricing accurately for particular conditions due to the limitations on the system. At the time, I was working at the reservations and booking department but I had a huge interest on software development, especially on web applications.

When our web developer left the company, I prototyped the algorithm to calculate the airport transfer pricing on SQL Server based on the number of passengers, arrival and departure dates. I presented this to my manager who was also in charge of the company's online sales, and asked for a budget and time for developing a new airport transfer booking system for the company. I explained that this system would have the same features as the old system along with the additional features we have always wanted. This was going to allow us to provide better service to our customers and reflect cheaper prices by having a maintainable system to build upon. My manger believed in me, and gave me time and budget to invest on this. I spent a month to develop the system and its content management system by coding the business logic in C# (while learning it at the same time), developing the user interface as a web application using HTML, CSS and JavaScript, and integrating it with an online payment system. I had to deal with lots of things I hadn't known about but having a good support from my manager made me always trust myself and keep pushing to come through all the obstacles. We rolled out the system under a different domain name first and advertised it through Google AdWords‎ (you could see that version see here even if the styles and functionality don't quite work on web.archive.org). Within the first 5 hours, we sold a Private Minibus Transfer through the system Over the weeks, we directed all our transfer booking channels to this new system and kept evolving it. After two years, the system sold 32% more transfers than the old system and yielded 26% more revenue. The final look of the system is still running here and maintained by the company (I should point out that I am not entirely responsible for the new look of the site , especially for those red primary action buttons!). Proving myself with this achievement also gave me a chance to take more responsibility on software engineering at the company and I was able to get a budget for an accommodation booking system (it can been found on web.archive.org) which yielded extra revenue for the company for a few years.

Lots of things happened after this and I achieved so much more such as being part of several successful teams to create valuable software products, being published, having the Microsoft MVP award for 5 years in a row, speaking at lots of international conferences, maintaining a successful blog for 7+ years and many more. However, nothing was able to beat that because it was a unique opportunity to be able to fight for something I truly believed in. Besides that, having a true leader as your manager is a unique opportunity. He trusted me and my skills, and when looking back at this now, it's very clear to see that I would never have become a good software developer without this trust and my confident in myself.

What's Your Proudest Achievement?

Well, it's your turn. Hopefully I encouraged you to share yours publicly as well. Please share yours as a comment here, preferably by linking to your blog post which you are about to write :)

Defining What Good Looks Like for a Software Engineer

Tugberk Ugurlu — Sat, 18 Mar 2017 15:00:00 +0000

If you are a software engineer, this is a very common question you will get to ask yourself a lot. This is going to be especially very frequent if you are being part of the recruitment process in your company. As you may know, I work at Redgate, and we have a good culture for development teams. Besides that, common characteristics of a good engineer with examples and counter examples for each engineering role are defined, too. This is a really good guidance for the employer to reflect their culture for a particular role. It’s also good for the employees to understand where they are on being an effective employee.

(Image is from https://commons.wikimedia.org/wiki/File:Coding_Shots_Annual_Plan_high_res-5.jpg)

I got inspired by this and I wanted to share the list of principals I value and look for within a software engineer. Obvious disclaimer: this is not the list of principals that my employer values even if the most of them are pretty similar. As we got the disclaimer out of our way, let's see these principals:

Knows the fundamental concepts, data structures and common algorithms rather than only being too good with a programming language or a specific framework w/o understanding the basics. In other words, know the basics and be polyglot.
Has good communication skills - both verbal and written. Without this, it's impossible to be a good software engineer.
Being pragmatic - Works incrementally and balances delivering value frequently with delivering high quality.
Iterates fast - Values Continuous Integration (CI) and Continuous Delivery (CD), makes their code fail fast, enforce consistency and keep master branch releasable. Your release process should as easy as adding a git tag as a valid semantic version.
Cares for sustainability - Strives for producing code which will sustain for years, even decades. Not one-off, works-now-who-knows-when-it-will-stop-working ones.
Knows the business - Cares to understand the business domain and strives for establishing an ubiquitous language between the software product team and stakeholders.
Strives for THE BEST UX - Makes user experience the part of the product completeness.
Being a team player - Works with their peers, gets/gives code review from/to them. Develops their skills while they are developing their own. Should strive for being transparent to the team all the time.
Knows the metrics but also has a vision - Should know the metrics and how to get them to make decisions. However, they should have a vision at the same time, too. They should not have the "Let’s ask the users” mindset as the default approach for product feature decisions. Remember, good artists copy; great artists steal! The problem you are trying to solve has been probably solved within the same or a different context. Find that, bend it and apply differently.
Disagrees and commits when needed - Should not be shy about getting their opinions out and pursue them. However, they should also know that a decision has to be made, and when that’s the case, they should commit fully and try to get the best out of it even if it’s not the decision they wanted to see.
Values open source and contributing back to software community - Has a blog, gives talks at conferences or user groups, contributes back to open source projects. Simply shares what they are proud of with others openly.

There are probably more but these are the most important ones that I care about and value at a very high level. However, I wonder what yours are, too. Therefore, please share them with me here by send me a comment.