Why NoSQL Does Not Mean NoDBA
By: Robert Treat 19 Jan '12
Whether you like it or not, NoSQL is changing the world. Granted, it's not even clear what NoSQL means sometimes, but there is no doubt that, for better or worse, we are in a renaissance of non-relational database systems right now. For me personally, I tend to ignore the hype, study these systems with a critical eye, and then deploy them where traditional RDBMS software struggles. I do occasionally bump into people who babble on about how NoSQL will put DBAs out of business. When I hear this kind of comment, I just nod my head and smile: It's hard to convince people that their beloved paradigm shift is just more of the same, and also very seldom worth the effort. However, I was recently talking to an Oracle DBA, and he made some comments about how he was concerned that new companies would have no use for a DBA because they were all switching to NoSQL. This surprised me a little, actually. I figured if there was one group of people who wouldn't buy into the NoSQL hype, it would be the stalwart Oracle crowd. Et tu, Brute? If the hype has gotten to them, I guess it's time for me to speak up. Whatever you think about NoSQL, the death of the DBA is a ludicrous idea.
One of the little known secrets of NoSQL systems is that they are used to hold data. Most NoSQL systems try to trumpet the ease of pushing data in and out of the system: "just push your JSON object to the server and your data is instantly stored, regardless of structure." It's Oh, So Magical. The problem, of course, is that easily dumping data into a system doesn't mean much if you can't get it back out. This is where a data model comes into play.
In a traditional RDBMS environment there is usually someone who designs a data model ahead of time, breaking down information into a relational design, sometimes even drawing up an ERD diagram for people to reference, so they can find their information. This would be turned into DDL, committed to the database and then enforced rather strictly. Try to insert the wrong type of data, and you'll get an error. Try to query for a column that doesn't exist, and you'll get an error. I've seen many a developer complain that the database is "too strict" because they couldn't get their queries right, but the truth is you still need a data model even though you went to a NoSQL system. You won't hear much about these problems from NoSQL advocates because the type of shops that use them are often smaller shops, where there are a small number of developers, and everyone knows what everyone else is doing, more or less. Or they are using NoSQL on new projects--where the scope of the system still fits in head-space pretty easily. However, as code bases grow, and projects stretch on for years, and developers come and go, not knowing what data is stored where means that you start to wish you had a data model. You recognize this the first time you hit a bug because you assumed that not getting a value back in the "items" key meant the user had no items in their cart, only to realize later that you should have been looking at the "item" key. If you're lucky, this type of bug pops up on information retrieval rather than in information storage; but either way, cleaning up such a bug can be painful. You'll also notice how, often, you have to pull back objects just to see what information is actually stored, because depending upon when the object was inserted, it may have a different opinion on what the data it should be holding looks like. Of course, you can go back and dig through the code that puts the data in, but this assumes you know where that code lives. Maybe you could ask the guy who originally wrote that feature, but suppose he left the company six months ago? Oh well, grep is your friend.
So let's pretend you've taken the time to discuss what data you must hold up front, and someone keeps a diagram of that posted on the wall. . . awesome. Sadly, you must still implement it in a physical system. In all the hype about "schema free" databases, the fact that in order to get good performance you still have to make adjustments for physical layout, or build things like indexes into your system to make queries against the server fast enough, is often overlooked. Yes, even on NoSQL you still need to know about explain tools and indexing.
Think about what this means: Someone has to recognize that a certain piece of data is going to be requested a lot or notice a performance problem on the existing server. Once you figure out the right index to build, someone has to build it in production. This means locking, potentially, and certainly means an IO hit. Is your application developer going to be responsible for this? Does he know if your index build will require backfilling, like when you build a secondary index on existing data in Riak Don't get me wrong, it's not that they aren't capable of doing this work, it's just important to realize that this type of work must be done.
OK, you have a data model, and are managing the physical implementation; that's good. But did you know that your NoSQL system still must interact with the disks on your server? It does. The better question is, do you know *how* it interacts with the disks? Actually, before we talk about the disks, do you know how it interacts with RAM? Some NoSQL systems absolutely fall over when they hit thresholds larger than RAM. For some, it's total data set size, for others it might be the size of all index pointers in the system. Of course, maybe you have a system that doesn't fall over, it just becomes slower, perhaps unacceptably slower. In either case, you need to be aware of these limitations. Have monitors in place for them, and then perform capacity planning accordingly. Now, let's get back to disks. How crash-safe is your NoSQL server? Does it give you single node durability? Are writes automatically synced to disk with each put, or are they batched up and pushed out occasionally? Maybe it's configurable; do you know how your systems are set up? If you were using Postgres, you could tune the durability guarantees for all of these cases. Your DBA knows this, and whoever is in charge of your NoSQL system needs to know this. Even if you think you are storing data you can afford to lose, chances are your business model must be aware of just how much exposure it has. Oh, but yours is a start-up, so you don't have business concerns yet. . .? Still, the level of durability is going to have significant impact on your IO needs, and that, in turn, will impact your performance--and you can't post your devops data-porn if you can't get decent systems performance.
Of course, disks are kind of unimportant these days, given that everyone runs multiple nodes, and you can have a distributed hash table running across multiple nodes with just a handful of Chef commands. That said, have you ever managed a complex distributed system? You know who probably has? Your DBA. By far the most common answer to the failover problem is to stick up a replicated database slave. It's also common to see people putting up slaves for horizontal read scaling. DBAs understand the tradeoffs in consistency guarantees that come with these types of systems--not just at the node level, but from the applications point of view as well. You'll need solid understanding of this on your dev team if you are going to build apps against a distributed data system. In addition, someone has to manage all of these servers and make sure they perform well. If your NoSQL system uses master-slave replication, someone with experience in this area might be handy. If you've ever built a Master-Master pair with individual Slave systems, you probably know what I am talking about. Oh, do you think running a clustered hash table system is easier? Just because you can add a new node to the ring doesn't mean it's free. You need both server level and cluster level monitoring in place. You need to make sure you can afford the IO and network strains as data is copied around, and you need to know under what circumstances locking will be involved. These things really do happen.
I remember when the MySQL documentation had a section devoted to explaining why foreign keys weren't needed. Of course, once MySQL finally implemented foreign keys, it became a major headline for their release announcements. This is what happens as systems mature. Most NoSQL systems can cut down on overhead by eliminating (or more accurately, not implementing) many of the features people have come to expect from an RDBMS. Of course, which features are eliminated differs across systems.
To wrap your head around this, you need to have a good understanding of how triggers work, how asynchronous calls affect transaction semantics (or the lack thereof), and what types of work you might want to do on the server side. Some triggers are used to enforce data integrity or do data manipulation at the server level; these are the types I think work fine within a vertically scaled system. Others really are extensions of the application, and while they are sometimes frowned upon for adding overhead into a centralized resource like a typical RDBMS, in a decentralized system that scales out the arguments against them aren't as clear cut. One thing I do know though: this is probably not something your SA wants to be involved with at all.
If all of this isn't enough to make you think twice, let me mention one more thing. While you may not have a query language in your NoSQL system, that doesn't mean you don't query against it. Whether you are writing distributed map-reduce queries, trying to balance link-walking vs. secondary indexes, or trying to figure out whether the code you've written to pull back every key in the system is going to be a problem; there are going to be times when you will have to make these queries more efficient. This is probably going to be a more application-centric type of tuning than the traditional RDBMS, but watch as someone in your dev team becomes known as "the go to guy" for making your map/reduce query run more efficiently. And incidentally, you should also be aware that many of today's NoSQL systems are trying to bolt on SQL and SQL-like interfaces into their systems. Who fails to get excited when thinking about rewriting queries with subselects into joins clauses against a Hadoop cluster?
If you think that managing all of this sounds like an impossible task, you're welcome. This is the job that DBAs have been doing for years. . .and yes, it can be incredibly challenging. Of course, it doesn't have to work this way. You can draw the lines of responsibility differently right now. Make the application developers manage the data model, design the schema, and tune the queries. Let your ops people be in charge of building new nodes, managing replication, and ensuring you have valid backups. Maybe 10 years ago, you had to have a DBA to work with Oracle, but nowadays just about anyone doing software engineering can put up a Postgres database and tune their way to usability with about three wiki links; you don't need a DBA to design good operational habits.
Also, some of you might think of this as some type of doom and gloom piece against NoSQL; it is not meant that way at all. It's not that switching to NoSQL is a bad idea necessarily; there are some things that RDBMS software can't do as well as a more dedicated solution. But, if you think that switching to NoSQL will just let you hand-wave away all of the challenges of running a database, you are terribly misguided. If you're a DBA and you are worried about a future with NoSQL, take heart; study your product less and focus on these key architectural design points more. Those skills are critical now, and they will remain so in the future, NoSQL or not.