all rss

Cloud Gates: March 2013 service upgrades

Sergey Schetinin — Wed, 03 Apr 2013 22:46:00 +0000

New deployment system

Last week we put a new deployment system into production. Deployment is basically a way of updating the software that runs the service, so for a product under active development it's a very important component. For CloudGates it's even more crucial, as you'll see later, but first I want to mention why would this matter to you.

Our previous system was pretty sweet as well, but when updating the software on the nodes, the FTP server had to be restarted, meaning the existing connections had to be interrupted. It wasn't an issue when we were just starting, but at this point there's no moment in time when we can do this and not interrupt a number of transfers. This obviously causes us to be apprehensive when doing deployments, as if we do it too often, the service will appear to just randomly drop connections.

If we were to find a way to deploy updates without interrupting existing connections, we would be free to deploy updates at any rate we want. And we did exactly that.

Now, even when we upgrade the server all existing connections and transfer just keep on going. The older connections are just handled by the older version of the code, the one that was the newest available at the moment of connection, but new connections are handled by new code. We see a lot of long transfers, sometimes taking days, so in that timeframe we might do a number of deployments and will actually have five different versions of the server as long as anyone is using it. Once all connections to the old server are done it exits, but not before that.

This soft upgrade process was something we intended to implement later, but having a number of rather big features ready for deployment it was clear that as those features are deployed, they will need to be tweaked and we need to have a way to do deployments freely. So even though a lot of features were ready to be released in March they were put on hold to put this deployment system into production.

New monitoring and failover system

There was a small hiccup on one of the nodes yesterday due to a big number of server versions running they ran out of file descriptors. This is already fixed, but the important thing is that this issue had very little impact on the service. The reason the impact was small is that our new monitoring system took that node out of the pool as soon as it detected that it is having problems.

This monitoring systems is also something we implemented and put into production this month. We use seven nodes located in a number of locations around the worlds to handle the customer connections locally and the monitoring system periodically tries to connect and log into each node. When it has problem connecting or thinks that things aren't going the way it expects it removes that node from the pool and in about half a minute that node is completely removed from the pool (the delay is caused by DNS TTL).

So when the one of the nodes was having problems earlier, the monitoring system removed it from the pool and the Cloud Gates service itself kept working as if nothing happened.

New error reporting

Yet another update put into production recently is a centralized error reporting for the FTP servers. When anything unexpected happens on the server the error is logged and we are notified of it. FTP servers generate a lot of logs, so having the errors highlighted is crucial. We can then go in and deploy a fix to whatever was wrong, which is usually some rare corner case we would not have encountered in synthetic testing.

Upcoming updates

You might have noticed a common theme for the March updates -- all of them are designed to make feature deployments safer. When we deploy new code we know that if it's problematic, the monitoring system will minimize the negative impact. We know that the error reporting will let us know what exactly causes the issue. And the deployment system will let us deploy an update in minutes without causing any disruption.

Of course we don't expect the new features to cause problems, but it's something you definitely want to be prepared for. And of course specifics of our service make the new deployment system practically required. Without it adding new features had to always be weighted against the need to restart the server and now we can finally start deploying freely.

We already have a number of features developed that were waiting for these updates before rollout and now we will start happily putting them into production. Expect quite a few new features this month!

Cloud Gates: Technology behind CloudGates [part 3]: Servers around the world appearing as one

Sergey Schetinin — Thu, 28 Mar 2013 00:34:00 +0000

Keeping data in sync

The customer might be using the same gate in Europe via one of our European nodes and in US at the same time, using one of the nodes on our US network. We need to make sure he would not be able to tell he's actually using two separate servers and we need to do it without adding overhead.

As explained in an earlier blog post we don't use caching on the nodes, so that's one potential problem solved already -- the customer sees all of the same data in both locations without any delays. Another question is how do we handle the initial login?

Mapping credentials to an S3 bucket

Having our own implementation of the protocols comes to the rescue again! If we were using some other FTP server software we would need to propagate gate management actions from the UI to all of our nodes. That is a harder problem than might seem at first glance -- any number of nodes might be down for maintenance and new nodes need to come online fully synced etc. We forgo this issue completely by being smart about the login process.

When a user signs into the gate, the FTP (or SFTP or WebDAV) his client software sends the FTP credentials to the server. Our server in turn forwards this login request to a login server over an encrypted connection. The login server then checks the database and responds with the gate details to the server.

If the credentials were valid, the server can now work directly with the client and will not need to talk to the login server until the next login. If the login credentials were wrong, the server obviously rejects the connection.

By splitting the login server into a separate entity we radically simplify our management and make the gateway nodes themselves practically stateless. And having stateless nodes is the holy grail of scalability.

Cloud Gates: Technology behind CloudGates [part 2]: Scaling

Sergey Schetinin — Thu, 14 Mar 2013 16:21:00 +0000

One of the primary advantages of using CloudGates with S3 as your FTP server is that you will never run out of disk space. And a lot of our customers are indeed making use of this, which means we need to handle Terabytes of data coming through the servers.

How many servers do we need?

If we were using just one big server that would work for a while, but at some point no server would be capable of handling all the transfers we see. Having just one server would also leave us open to network outages, high-usage customers would severely impact bandwidth available to each other and the server would need to be placed in a specific datacenter anyway making it slow for the rest of the world.

It was clear to us from the beginning that we will need multiple nodes around the world for our service to meet the quality standards we set for ourselves. The simplest way to do this would have been to assign users to a specific node and create the FTP user on that node only. This would save us some work, but it's not future-proof -- eventually nodes would need to be decommissioned or upgraded and the users would need to be migrated. Assigning users to nodes would have been a recipe for creating uneven load across servers and exposing us to downtime.

Making the service scale

To make things transparent for the user and to allow us to scale the service all of our gateway nodes are interchangeable. No matter what node the customer connects to it will accept the same credentials and will provide the same service.

Having such flexibility lets us deploy as many nodes as we need in any number of Points of Presence around the world. We made all the right architectural decisions to keep the nodes on equal footing, and the payoff is truly unlimited scaling.

Cloud Gates: SFTP / SSH support is in production

Sergey Schetinin — Thu, 07 Mar 2013 13:19:00 +0000

Our gates into S3 and Glacier had SFTP support in beta for a while but it's finally running in production. All of the existing and new gates have SFTP enabled and your connection details are on the gate info page.

SFTP sounds similar to FTP but is actually a completely different protocol. It relies on SSH for transport (which means it's encrypted) and supports multiple concurrent transfers over the same connection.

If your client supports it, it's generally preferred over FTP, but the protocol is more complex than FTP and some of the clients have a suboptimal implementation which means it's could be slower than FTP in those cases.

We use a 100% in-house implementation of the protocol which allows us to make SFTP gates to S3 as efficient and fast as humanly possible. Our implementation supports a set of secure ciphers which cover all of the clients we came across. (For example some WinSCP versions need aes-128-cbc and Postini needs blowfish-cbc).

SSH also has support for private-key authentication, but we didn't implement it in the gates at this point. Let us know if this is something you'd want to see on our service.

Cloud Gates: Technology behind CloudGates [part 1]: Providing FTP servers

Sergey Schetinin — Thu, 21 Feb 2013 14:08:00 +0000

CloudGates is cloud-hosted service where you can create an FTP server in literally seconds. That server is backed by your Amazon S3 or Glacier account and is private. Sounds simple enough, but there's a lot more to it than meets the eye.

We make it look as simple as possible to the customer, but to make it happen there's a neat technical solution working behind the scenes. A naive implementation would have used an existing FTP server implementation and would try to emulate a filesystem underneath it, but that is not a robust solution and what we do is different.

Benefits of the custom implementation

The CloudGates servers are custom implementations of the FTP, SFTP and WebDAV protocols that translate protocol commands directly into operations on the underlying storage.

There's no caching happening on our servers and it's a good thing — each upload, rename and every other operation is carried out directly on the S3 storage itself. This way the success and failure reported to the FTP client is always true to what happened. You will never see any data out of sync and will never need to wait until it appears on S3. There's no possibility of data loss in transit when the cache overflows and no security issues with the data hanging around in caches on the server. We don't use a cache system so all of these problems are non-existant.

Supporting huge transfers

This also allows us to support arbitrarily large uploads (for now we enforce a limit of 32Gb for a single upload, but we will be removing the limit altogether). The uploads do not hit the disk or buffer on the server, instead we create a multipart upload and upload chunks of the file as we receive them. Resuming uploads is implemented in a similar fashion.

We even upload the chunks concurrently whenever possible. So even though the S3 API itself does not present any simple ways of doing streaming uploads, with Cloud Gates you get them for free. We even observed that our concurrent code can often make the uploads through our service faster than uploads directly to S3 from dedicated S3 client software!

This is yet another area where our own implementation pays off big time. Having full control over both the FTP protocol and S3 API communication allows the gateway to do these operations efficiently even if you are using any old FTP client. This would not be possible with a less committed approach and would result in a severely limited service. An FTP gateway that only supports bulk uploads is useful, but we opted to implement every command there is and make all of them robust and it worked out really well.