Scaling Riak to 25 million ops/day at Kiip

Riak is an open source, highly scalable, fault-tolerant distributed database.

Kiip is a mobile reward (mobile ad) network: it connects brands and companies (advertisers) with consumers through virtual achievements in their apps. As their application’s use increased, they needed to adjust their infrastructure to meet the increasing demands (25M ops/day). In the following talk, Armon Dadgar and Mitchell Hashimoto discuss how and why they are using Riak in production, the problems they we facing and their solution.

The talk was recorded at the May San Francisco Riak Meetup, in May, 2012.

MongoDB in production and beyond

Kiip started their infrastructure with MySQL, but switched to MongoDB before they had any real traffic. As they were using it, and as their traffic and data grew, they experienced the following problems:

  • A global MongoDB write lock limited their incoming write operations. Their solution was to cache and aggregate the writes over 10 seconds, and with batched updates, they hit the lock less frequently.
  • They were still limited to 1000x read/sec, probably because some IO problem, but heavy caching solved that.
  • Slow analytics queries touch a lot of data, and they were IO-bound operation. Using Bloom-filters solved this issue for a while.
  • As they hit the global write lock again, they separated two clusters for different datasets. Analytics with heavy writes in the first one, everything else in the other cluster.
  • As the data on a given machine grew, the index size hit the memory limits. They have introduced an ETL (Extract / Transform / Load) data-processing workflow, to separate older data from the database, archiving older ones to S3.
  • ETL lagged behind more than its daily schedule, they needed further DB separation.

Eventually, after they have found that all their API response times were related to MongoDB, they started to research other DBs. Their finding and opinions were the following:

  • They ditched RDBMS, as they didn’t want to deal with custom, homemade sharding layers and the added complexity that comes with those.
  • Their co-founder is from Digg, they had bad experience with Cassandra.
  • HBase seemed to be an operational nightmare.
  • CouchDB did horizontal scaling on application level, it was a no-go.
  • Riak had solid academic foundation, with happiness both from developers and operations in the real world.

Data migration to Riak

Fast growing data was the natural candidate to move first to Riak: they have started with session and device data, as they grew at exponential rate.

Sessions have the advantage to be key-value by nature, a good fit for Riak. They updated their data access layer, besides that, not application-level change was required. The python client had some problems in the protobuf implementation, and some errors that are already fixed now (e.g. keepalive header).

The switch was simple and without downtime:

  • write new data to Riak
  • read from Riak first, fall back to MongoDB if missing

After a week, they have removed MongoDB completely.

The devices data was a bit different, eventually they have settled with a key-use that was generated from a hash-function. Their other attempt (e.g. using secondary indexes, map/reduce or id indirection) were too slow with unusable latency (0.2-2 seconds).

Riak in production (over 3 months)

Kiip team found Riak extremely solid, with only a few pain points, and they have provided advices and tips that might help others:

  • Scale early. Once your system is at the saturation point, it is really hard to get done anything with the load. Latency explodes at heavy IO, and the new node will just increase that in the short term (while it gets the data re-balanced). Monitor latency.
  • Don’t use secondary index (2i) in real-time queries, just in background (5ms get, vs 2000 ms 2i). LevelDB adds a new disk seek IO at each new level, adds more time to the latency.
  • The JavaScript engine requires a lot of RAM, runs out very quickly, cap the use.
  • Don’t restart nodes in rapid succession. Bad automation might result in strange conditions, ruins performance and sometimes causes inconsistencies.

But nothing is a magic bullet, Riak is no exception. They still use MongoDB for geospatial data, however, they will migrate those to PostGIS. Kiip uses PostgreSQL for non-key-value, not fast-growing data.

Scaling is hard, but for the horizontally scalable K/V data, Riak seems to be a right choice.

Date: 2012-09-25

Share on...

Let us know what you think!

Send your feedback by e-mail!