Hardly a month goes by these days without some exciting new technology hitting the blogosphere, filling the imagination of CTOs all over. At OmniTI, we are often approached by people asking about the "razor's edge" technology of the week. Frequently, they are convinced that this is the technology that they need for their business, and often will try to shoe-horn their requirements to fit the new toy. We typically have to convince people of things like their dusty old relational database actually handling their data needs just fine, even if it isn't web scale. Tried and true typically works better than shiny and new.
Sometimes however, a client's requirements really do lend themselves nicely to the newer technologies and we are justified in playing with them during business hours instead of at home! We love our jobs at OmniTI.
The request we'll review here was fairly simple. The client needed a highly scalable and fast web service to provide geo-location data, based upon the ip address of the requester. They also had to serve a small static file. The service would run for a few months and then would be discontinued. It didn't require true high availability, but we had to be able to fix it quickly if something went wrong.
Using technologies we were already employing on the project, we wrote a simple Mungo-based perl script to look up the information in the MaxMind city level database and return the data inside of a JSON object. Once placed on the existing Apache httpd servers along with the static document, we had a working prototype for their third-party to develop against, while we looked at the more complex issues in the request.
In this case, there were two immediate concerns:
- The service had to be fast and handle a lot of requests.
- This component should not endanger the availability of the rest of the web services.
The web farm already deployed to handle the client's business used the Apache httpd server and, leveraging the platform flexibility, it grew to support a number of legacy web services. As this setup was already tweaked for these particular needs, we didn't really want to reconfigure it. However, we needed to know where we stood from a performance point of view, to find out how much traffic we could handle. A quick Apache bench testing revealed:
Document Length: 188 bytes
Requests per second: 1306.85 [#/sec] (mean)
Time per request: 38.260 [ms] (mean)
Time per request: 0.765 [ms] (mean, across all concurrent requests)
Transfer rate: 520.79 [Kbytes/sec] received
Our goal was to be able to safely handle about 5,000 requests per second. While we could sustain that traffic by scaling out across our client's multiple web servers, when traffic volume would reach worst-case expectations, there would be an unsafe likelihood of the servers becoming saturated, followed by service degradation. Or worse yet, all of the web services could become completely unavailable. Needless-to-say, either case would be unacceptable. We had to isolate this service from the rest, however isolating on similar hardware as we were currently using for the web-farm would have been a prohibitively expensive solution, especially considering the transient nature of a project designed to last only a few months.
With such requirements, a cloud deployment was the obvious choice. While there are plenty of reasons to stay away from the cloud, there are some really good reasons to use it, as well. The cloud would let us use exactly as much CPU and bandwidth as we needed, and provide an easy and quick way to get more if we required it. Our service did not store persistent data, even at a session level, so if a cloud instance went "poof," there was nothing we couldn't afford to lose. When no longer needed, we could just shut down or scale back the servers, without worrying about excess hardware--the exact benefit cloud supporters always want. EC2, here we come!
With the move to EC2, we had the option to deploy the prototype code that we had written already. However, as that code leveraged an existing ecosystem designed to service a much wider spectrum of needs, duplicating the environment would have been overkill, and attempting to strip it down to the minimum necessary would have been a rather daunting challenge with little long-term benefit. With the luxury of exploring a green field approach, we turned our attention to Node.js. At OmniTI, we had the advantage of having seen Node.js used already a few times for production services, and we had even incorporated it into a few solutions we had developed, so we knew that the type of light-weight, fast response code that we were looking to develop for this project was very well suited for Node.js. Through a bit of serendipity, Theo Schlossnagle had just recently branched, and then finished, a new version of node-geoip that was capable of reading the MaxMind City database. Add to that my personal joy for getting the chance to use Node.js in production, for a customer project, and the decision was clear.
Plan in hand, the perl script was quickly converted to Node.js and placed on a small Apache EC2 instance for load testing (thanks to Zach Malone for assistance with all of the cloud benchmarking work). The entire code follows.
var http = require('http'),
sys = require('sys'),
geoip= require('geoip');
var con = new geoip.Connection('/www/geodata/GeoIPCity.dat', 0, function(){});
http.createServer(function (req, res) {
if( req.url == '/get_city' ) {
res.writeHead(200, {'Content-Type': 'text/plain'});
var ip = req.headers['x-forwarded-for'] ||
req.connection.remoteAddress;
con.query( ip, function(result) {
var obj = new Object();
if(!result){ obj.city = 'Unknown'; }
else { obj.city = result.city; }
res.end(JSON.stringify(obj) + "\n");
});
} else if( req.url == '/crossdomain.xml' ) {
res.writeHead(200, {'Content-Type': 'text/xml'});
res.end("\n\n\n \n \n");
} else {
res.writeHead(404, {'Content-Type': 'text/plain'});
res.end("File not found.\n");
}
}).listen(80);
Slightly more than twenty lines of code. This will return a JSON object with the name of the nearest city, based upon client IP address, or "Unknown" if it does not resolve. It will also serve a crossdomain.xml file to any flash objects that need one, and return a 404 to any other requests. Where's the web server, one may ask? Node.js takes care of all of that for you.
Simple Apache bench testing of this code gives between 300 and 600 requests/second on a Small EC2 instance with a single virtual CPU.
Document Length: 223 bytes
Requests per second: 344.75 [#/sec] (mean)
Time per request: 145.033 [ms] (mean)
Time per request: 2.901 [ms] (mean, across all concurrent requests)
Transfer rate: 96.62 [Kbytes/sec] received
Yes, much slower than what we were benchmarking on our web-farm, but much cheaper to scale out, not to mention that we were having the service separation we wanted. To scale out, we had to load balance our instances; Amazon's Elastic Load Balancing was used here.
It was expected that a small decline in service would occur due to the overhead, but we were pleasantly surprised to see slightly BETTER performance. Apparently, getting the IP address in Node.js from the request header is faster than getting it from the connection object, so having a load balancer in the middle actually improved performance.
Document Length: 223 bytes
Requests per second: 367.87 [#/sec] (mean)
Time per request: 135.918 [ms] (mean)
Time per request: 2.718 [ms] (mean, across all concurrent requests)
Transfer rate: 112.40 [Kbytes/sec] received
In repeating the test with two, and then three, server instances behind the load balancer, each new instance added continued to scale the volume of requests-per-second we could handle by another ~350-600 requests/second. So, with only three small EC2 instances, we were able to crank out between 1200-1500 requests/sec of GeoIP lookups.
350 to 600 requests/second is a pretty large window, and it means that some of our EC2 instances do much more work then others. This is something that you have to deal with when you are deploying a cloud-based solution. Thankfully, EC2 gives you a lot of flexibility to rapidly create and destroy instances, so if you get an especially slow instance, it can be worth throwing the instance away and creating a new one. As a bonus, if needed, it takes fewer than 15 minutes to manually get a new instance provisioned, set up, and running, without using Amazon EBS. Not relying on EBS enabled us to dodge the infamous EC2 outage. Our service was unaffected despite running in the unfortunate Virginia data center cloud.
Now, just because we were using Node.js, and we were deploying to the cloud, doesn't mean we toss away due diligence. In order to make certain that the EC2 solution offered good performance for the money, we decided to benchmark the same code on a Joyent SmartMachine that we had available. A single Joyent system had the performance of ~3.5 small EC2 instances:
Document Length: 223 bytes
Requests per second: 1564.30 [#/sec] (mean)
Time per request: 31.963 [ms] (mean)
Time per request: 0.639 [ms] (mean, across all concurrent requests)
Transfer rate: 438.43 [Kbytes/sec] received
The cost of the Joyent system was, however, twice as much as three small EC2 instances, plus a Elastic Load Balancer. Joyent includes a generous amount of bandwidth with any instance (Amazon does not), but their large, fixed monthly cost meant that we would not have as much flexibility to scale up and down as we did with EC2, which has hourly billing.
So, we had a working solution at this point, but we still had to make sure it would continue to work; in short, it had to be monitored. Normal end-to-end monitors and request timing monitors were put in place on the load balancer, as well as checks on each individual server instance. But, we also wanted to know how much traffic we were serving without anymore fuss. Node.js could keep track of that for us as well. By simply adding:
var cities = 0, xmls = 0, fnf = 0, status = 0;
... some variable++'s in the appropriate spots, and ...
} else if( req.url == '/status' ) {
res.writeHead(200, {'Content-Type': 'text/plain'});
status++;
var obj = new Object();
obj.cities = cities;
obj.xmls = xmls;
obj.fnf = fnf;
obj.status = status;
res.end(JSON.stringify(obj) + "\n");
. . .we could see exactly how much traffic, of each type, that each Node.js instance had served; along with whether any of them had crashed (as evidenced by a reset counter). This was set up to be pulled by Circonus which can consume the JSON data and graph the usage trends over time.
Perhaps also of interesting note, all of this was done almost a year ago. "A few months" turned into much longer. The client's required utilization has gone up and down with a corresponding number of EC2 nodes added or removed. But this simple script hasn't had to be modified or touched since. It has happily run a production service without any problems for a minimal amount of time invested.
To be fair, this was a rather simple problem that could have been solved in a number of different ways, perhaps even more effectively. But sometimes it behooves you to explore those sexy new technologies, learn their trade-offs and understand them better. In this way, you'll understand the trade-offs involved, and you can feel comfortable deploying them for critical components of an architecture. While it's essential to remember that sexy doesn't mean good, it's a pleasant reminder that sometimes good can be sexy.