In 2009 we were were growing fast. Very fast. So fast in fact that we were beginning to crush our MySQL database while running queries against a few tables that had grown to contain between 10 and 30 million rows. We moved the data from some of our largest tables into CouchDB to help us reduce the load on our MySQL server (see this series of blog posts for more info).
The move to CouchDB went well. Pages in our web application that would occasionally time out were now loading in a couple of seconds. And, our MySQL database was much, much happier. We liked CouchDB so much that we started planning a feature that would make heavy use of CouchDB’s schema-less nature.
And that’s when the wheels came off.
The feature in question is what we now call our Subscriber Profile. The Subscriber Profile feature allows you to collect any information you’d like from your subscribers, and use that data later to send them custom, targeted messages. We knew that the variety of different information our customers would want to collect was wide. Understandably, everyone’s needs are custom so we wanted to build a framework that could handle the array of unique information our customers wanted to store. A document database was a great fit for this problem. The schema-less nature of a document database lets you store anything, allowing one subscriber’s document to be completely different from another’s.
Since CouchDB does not support ad-hoc queries, we used the popular couchdb-lucene add-on to allow us to search the documents. This powered our segmentation capabilities, enabling us to find the subscribers that matched criteria defined by our user.
This approach worked just fine for customers with smaller databases, but struggled to keep up for customers with millions of subscribers. Looking back, there are several reasons why CouchDB and couchdb-lucene were not a good fit for this project.
Why CouchDB and couchdb-lucene Were Not the Right Fit
HTTP is a Very Slow Database Protocol
You can do some amazing things when using HTTP as your database protocol. Anybody who can talk HTTP can be a client (which is pretty much everybody). Load balancing between multiple CouchDB instances is a breeze. And, you open the door to do some pretty innovative things, like allowing your web application to live side by side with your data, and serving it directly from the database to the client without the need for a separate web server (CouchApps).
But, HTTP is incredibly slow compared to a binary database protocol. Consider the simple task of fetching a piece of information from the database. According to our simple tests, fetching a document by ID in CouchDB takes 20ms. Compare that to 5ms when fetching a document from MongoDB (by an indexed field), and 0.4ms when fetching a row from MySQL by ID.
When using CouchDB views, the speed at which CouchDB fetches and returns the pre-computed view results more than makes up for the latency introduced by HTTP. However, when performing many operations on a document by document basis, this latency starts to add up.
MVCC (multiversion concurrency control) is an important part of CouchDB’s design, and a key part of how replication works. Every write to a document in CouchDB creates a new version of that document in the database. The old versions of the document stick around, eating up disk space, until they are cleaned up by running a compaction operation on that database.
Database compactions themselves can be resource intensive and time consuming operations. If you’re not taking advantage of the benefits provided by MVCC, then the work required compact your database, and the extra disk space used by the older versions of the documents, are simply overhead. This is especially true in a write heavy database.
Large Databases Beat Up the Hard Disk
Our CouchDB instances (one primary, and one backup) were running on two 8 core 2.2 GHz machines, each with 32GB of RAM and 6 15,000 RPM hard drives in a RAID 10 configuration. Despite these database servers being fairly capable machines, it was uncommon to see the disk utilization (as reported by iostat) drop below 80%. In other words, CouchDB kept the disks on these database servers very busy.
Although we are not entirely sure what was causing this, we believe it had something to do with the size of our databases and views, and them being constantly swapped in and out of the disk cache (which was set to use almost all of the system memory).
CouchDB is not a Distributed Database (by default)
Apache CouchDB is not a distributed database. It is a stand alone database with powerful replication capabilities. While it is certainly possible to build a distributed solution using those replication capabilities (see BigCouch), nothing like this exists in the base CouchDB project right now.
We were at the point where we really needed a distributed database. The size of our databases and the number of requests they were handling were simply becoming too much for a single machine to handle. While moving to something like BigCouch was certainly possible, it would not have solved the other issues outlined above.
Lucene (used by couchdb-lucene) is not a Distributed Search Solution
Couchdb-lucene is a project that pipes changes made in CouchDB to the Apache Lucene full text search engine. This project is great because it essentially adds the ability to perform ad-hoc queries back to CouchDB. However, our search index had grown to the size where we needed a distributed search solution.
Again, while switching to use ElasticSearch (which integrates with CouchDB nicely) would have addressed this pain point, it would not have addressed the other issues we were facing.
Other General Complaints
We had some other general complaints with CouchDB as well.
First and foremost is that we found CouchDB difficult to develop for. The map/reduce paradigm takes a while to get used to. But, even after you understand how to build views to query your data, issues remain. Like the amount of time it takes to build a view against a large data set. We would sometimes wait days for a view to finish building. And, if by chance you found an error in the view flushed out by some edge case in the production data, you would have to scrap the view and start over after fixing the issue in the map/reduce code. This resulted a slower development process.
Second, databases and views are gigantic on disk. I know, I know…disk space is cheap, right? Well, not for everybody. Because of the size of the views, we had to upgrade to a pair of very expensive database servers that increased our monthly hosting bill by a substantial amount. Also, since the database and views are stored as single files on disk, it dramatically increased the amount of time it took to run our nightly backups. Even if only a single document in your 500GB database changed, you’d have to back up the entire 500GB file.
Third, we even ran into a few problems with replication, which is CouchDB’s main strength. When migrating CouchDB to a pair of beefier database servers, we ran in to a problem where replication would forget where it was in the process. This resulted in CouchDB having to churn through the entire 240GB database in order to determine the sequence number that it should start replication from, a process which took 4 days. This happened on more than one occasion, and was a huge time suck.
The New Solution
We have been playing around with MongoDB for quite some time. We first used it to back our mobile store locator feature. MongoDB has fantastic support for geo-spatial queries. From there, we used it to dramatically speed up the performance of our subscription list reporting capabilities. It had handled these two tasks very well, and after doing some research, we determined that it may also be a better fit for our Subscriber Profile feature.
Why use MongoDB instead of CouchDB for this feature?
- When it came to fetching and updating single documents, MongoDB was considerably faster than CouchDB
- MongoDB doesn’t use MVCC, so there was less overhead to deal with
- MongoDB doesn’t beat up the hard disk
- While using replica sets, MongoDB is more of a distributed database, distributing read requests, and handling fail overs
This left us with one question. How would we go about implementing our segmentation capabilities, or, the ability to find a group of subscribers based on some criteria specified by the user? Unlike CouchDB, MongoDB supports ad-hoc queries. But, like a relational databases, the proper indices need to be defined in order to make those queries fast.
In the hopes of removing the need for a separate full text search engine, we spent some time trying to determine if we could use MongoDB indices to implement the segmentation logic. Despite our best efforts, we were unable to make this work. Since our users can create any number of properties, and search for subscribers based on any of them, we would have had to index every property in every document . This would have required way to much memory.
So, we needed a full text search engine after all. Already knowing that we needed a distributed solution, we took a look at Apache Solr (Lucene’s distributed cousin) and ElasticSearch (also backed by Lucene). Using a Lucene backed search solution meant that we would be able to keep a good chunk of the code that we had already written to segment subscribers based on their profile data. We chose ElasticSearch based on it’s impressive feature set, strong community, and the fact that the project has a whole lot of momentum right now.
What All of this Means For Our Customers
Moving to a distributed solution for holding the subscriber profile data offers the immediate benefit of quicker response times when interacting with customer profile data. Also, this solution offers a more reliable service moving forward. You can rest assured that we have the capability to grow our infrastructure as you grow your subscriber base.
In addition, we have already been able to take advantage of our new distributed full text search engine. Shortly after standing up our ElasticSearch cluster, we moved our SMS message search functionality from Lucene to ElasticSearch. This added the ability to search for SMS messages up to a year old (previously you could only go back 90 days), as well as provided the ability to search for messages based on their “type”.
More recently, we were able to utilize ElasticSearch’s Facets API to power the data you see on the People Dashboard.
The future of CouchDB at Signal
As I mentioned at the beginning of this post, the Subscriber Profile was just one feature using CouchDB, and by far the one that caused us the most trouble. CouchDB and Lucene were simply not the right tools for this job.
The fact that we changed this feature to use MongoDB and ElasticSearch does not mean that we do not continue to use CouchDB, or would never use it for anything again. CouchDB views are an incredibly powerful feature. They’re not as flexible as some other options, but they really do allow you to query a large amount of data very quickly. And, CouchDB’s replication capabilities, and it’s ability to run on Android and iOS devices make it a very compelling choice for mobile devices, especially for applications that need to operate without a internet connection.
We are still using CouchDB in other areas of our application. While we have plans to move some of these features off of CouchDB and onto MongoDB, we currently have no plans to move the others.
I would never throw out a hammer because it didn’t help me drive in a screw.