Ten minute guide

If you haven't already done the thirty second quickstart to get up and running, go do that first!

Client drivers

So far we've been using the Data Explorer to run queries. Of course sooner or later you'll want to learn how to write ReQL queries in your favorite programming language. If you'd like to do that now, go forth and learn about the client drivers!

Updates

It turns out that historians unearthed a missing episode of Star Trek TNG (in which Data & Geordi are stuck in a holodeck dance-off and must boogie their way to freedom). Fortunately we can easily correct this error by running an update query:

r.table('tv_shows')
  .filter({ name: 'Star Trek TNG' })
  .update({ episodes: r.row('episodes').add(1) })

The statement r.row('episodes') above allows getting the current value of an attribute. You could, of course, use multiple values from the document, inner queries, etc. but we won't get into that quite yet.

Actually, instead of incrementing the number of episodes we could accomplish this goal more simply— just set the value to 179:

r.table('tv_shows')
  .filter({ name: 'Star Trek TNG' })
  .update({ episodes: 179 })

But incrementing is nice because it gives the opportunity to introduce the notion of atomic updates. When you have a moment, go read about them in the architecture FAQ.

Table joins

Let's do something a little more interesting. Let's add another table with characters from TV shows, and include a few characters from each show:

r.db('test').tableCreate('characters');
r.table('characters').insert([{ name: 'Worf', show: 'Star Trek TNG' },
                              { name: 'Data', show: 'Star Trek TNG' },
                              { name: 'William Adama', show: 'Battlestar Galactica' },
                              { name: 'Homer Simpson', show: 'The Simpsons' }])

Suppose we want to join the two tables and for every show we have, list every character in the show. Piece of cake! Delicious, delicious cake.

r.table('tv_shows').innerJoin(r.table('characters'),
                              function(show, character) {
                                return show('name').eq(character('show'))
                              })

RethinkDB supports a number of different join types (inner, outer, and optimized equality joins). Read about them in the ReQL reference.

Sharding

So far, we've been running everything on a single instance of RethinkDB. Suppose our Star Trek and Battlestar Galactica database is starting to pick up steam, because the unenlightened have finally recognized the significance of these timeless classics. Let's add another node to the cluster and shard the database across two nodes.

First, start a second rethinkdb process and join it with the first one:

$ rethinkdb -j localhost:29015 --port-offset 1 -d rethinkdb_data2 --machine-name Riker
info: Creating directory 'rethinkdb_data2'
info: Listening for intracluster connections on port 29016
info: Connected to server "Kunkka" f96ce5d0-f7db-4705-9269-877514d9f46d
info: Listening for client driver connections on port 28016
info: Listening for administrative HTTP connections on port 8081
info: Server ready

Note that we added a port offset so that each port is incremented by one, and a different directory is used to store data. This allows you to run a second instance of RethinkDB on the same physical machine and avoid port conflicts.

Let's shard our database! RethinkDB does sharding per-table. A really easy way to shard is view the table in the web UI and simply change the number of shards. That guided process uses statistical information about the data in the table to pick good split points for the shard.

Since we only have a few documents in the database, we want to be a little more precise, so we'll be using command line administration to shard. First, let's start up an administration tool and connect it to any machine in the cluster:

$ rethinkdb admin -j localhost:29016

Let's see information about the characters table:

localhost:29016> ls characters
...
shard      machine uuid                          name    primary  
-inf-+inf  878dc732-dab4-4dfd-af66-49fe695e2863  Kunkka  yes

After some information about the table, we see that it has one shard with a master on the machine named Kunkka. Let's split the table into two shards using m as a split-point.

localhost:29016> split shard characters m

Let's see the table information again:

localhost:29016> ls characters
...
shard   machine uuid                          name    primary  
-inf-m  878dc732-dab4-4dfd-af66-49fe695e2863  Kunkka  yes      
m-+inf  760701e5-9398-45a9-bf36-fc57afcf5809  Riker   yes      

We have two shards, one on each machine! All documents with a primary key less than or equal to 'm' will go onto machine Kunkka, and all documents with a primary key greater than 'm' will go onto Riker.

Note that since we didn't specify a primary key attribute when we created the table, RethinkDB automatically generates a randomized key for each document stored in the attribute 'id'. The table will be sharded by this key. (The architecture FAQ has more information on how RethinkDB does sharding).

Parallelism, chaining, and aggregation

The wonderful thing about sharding in RethinkDB is that you can shard your tables without making any changes to the application. Queries will automatically be distributed across the cluster, executed in parallel, and the results will be combined and returned. For example, if you rerun the join query above, you'll get the same results despite the fact that data may now be on different machines distributed across the network.

Another nice thing about the parallelization engine is that queries can get arbitrarily complex and they'll still be parallelized and distributed by the system. We already saw an example of chaining queries when we updated the number of TNG episodes— we first used filter to specify which records we wanted to update, and then ran update on the filtered records (a query that was parallelized across shards as well).

Let's chain our join query to do grouping— instead of listing characters, let's figure out how many characters each show has:

r.table('tv_shows').innerJoin(r.table('characters'),
                              function(show, character) {
                                return show('name').eq(character('show'))
                              })
                   .zip()
                   .groupBy('show', r.count)

This query will grab data from the right shards, do cross-shard joins, and do parallelized aggregation. Easy as pie!

Map/reduce

The groupBy command above isn't your grandma's groupBy command. It's built on top of a fully parallelized Hadoop-style map/reduce infrastructure. The query above is actually syntactic sugar for the following map/reduce command:

r.table('tv_shows').innerJoin(r.table('characters'),
                              function(show, character) {
                                return show('name').eq(character('show'))
                              })
                   .zip()
                   .groupedMapReduce(
                     function(doc){ return doc('show') },          // group mapping
                     function(doc){ return r.expr(1) },            // document mapping
                     0,                                            // base value
                     function(acc, val) { return acc.add(val) })   // reduction

Note that groupedMapReduce makes full use of the MVCC infrastructure, so you can run it on top of a live system completely lock free (read more about MVCC in the architecture FAQ).

Next steps

Note: the next time you start RethinkDB, it will look for the rethinkdb_data directory in the current directory. If it finds the rethinkdb_data directory left over by this tutorial, it will expect a second node to join the cluster.

Phew. Of course you wouldn't use the Data Explorer to write actual applications. Take a few minutes and learn how to use the client drivers from your favorite programming language.