Ten minute guide
If you haven't already done the thirty second quickstart to get up and running, go do that first!
Client drivers
So far we've been using the Data Explorer to run queries. Of course sooner or later you'll want to learn how to write ReQL queries in your favorite programming language. If you'd like to do that now, go forth and learn about the client drivers!
Updates
It turns out that historians unearthed a missing episode of Star Trek TNG (in which Data & Geordi are stuck in a holodeck dance-off and must boogie their way to freedom). Fortunately we can easily correct this error by running an update query:
r.table('tv_shows')
.filter({ name: 'Star Trek TNG' })
.update({ episodes: r.row('episodes').add(1) })
The statement r.row('episodes') above allows getting the current value of
an attribute. You could, of course, use multiple values from the
document, inner queries, etc. but we won't get into that quite yet.
Actually, instead of incrementing the number of episodes we could accomplish this goal more simply— just set the value to 179:
r.table('tv_shows')
.filter({ name: 'Star Trek TNG' })
.update({ episodes: 179 })
But incrementing is nice because it gives the opportunity to introduce the notion of atomic updates. When you have a moment, go read about them in the architecture FAQ.
Table joins
Let's do something a little more interesting. Let's add another table with characters from TV shows, and include a few characters from each show:
r.db('test').tableCreate('characters');
r.table('characters').insert([{ name: 'Worf', show: 'Star Trek TNG' },
{ name: 'Data', show: 'Star Trek TNG' },
{ name: 'William Adama', show: 'Battlestar Galactica' },
{ name: 'Homer Simpson', show: 'The Simpsons' }])
Suppose we want to join the two tables and for every show we have, list every character in the show. Piece of cake! Delicious, delicious cake.
r.table('tv_shows').innerJoin(r.table('characters'),
function(show, character) {
return show('name').eq(character('show'))
})
RethinkDB supports a number of different join types (inner, outer, and optimized equality joins). Read about them in the ReQL reference.
Sharding
So far, we've been running everything on a single instance of RethinkDB. Suppose our Star Trek and Battlestar Galactica database is starting to pick up steam, because the unenlightened have finally recognized the significance of these timeless classics. Let's add another node to the cluster and shard the database across two nodes.
First, start a second rethinkdb process and join it with the first one:
$ rethinkdb -j localhost:29015 --port-offset 1 -d rethinkdb_data2 --machine-name Riker
info: Creating directory 'rethinkdb_data2'
info: Listening for intracluster connections on port 29016
info: Connected to server "Kunkka" f96ce5d0-f7db-4705-9269-877514d9f46d
info: Listening for client driver connections on port 28016
info: Listening for administrative HTTP connections on port 8081
info: Server ready
Note that we added a port offset so that each port is incremented by one, and a different directory is used to store data. This allows you to run a second instance of RethinkDB on the same physical machine and avoid port conflicts.
Let's shard our database! RethinkDB does sharding per-table. A really easy way to shard is view the table in the web UI and simply change the number of shards. That guided process uses statistical information about the data in the table to pick good split points for the shard.
Since we only have a few documents in the database, we want to be a little more precise, so we'll be using command line administration to shard. First, let's start up an administration tool and connect it to any machine in the cluster:
$ rethinkdb admin -j localhost:29016
Let's see information about the characters table:
localhost:29016> ls characters
...
shard machine uuid name primary
-inf-+inf 878dc732-dab4-4dfd-af66-49fe695e2863 Kunkka yes
After some information about the table, we see that it has one shard with a master on the machine named Kunkka. Let's split the table into two shards using m as a split-point.
localhost:29016> split shard characters m
Let's see the table information again:
localhost:29016> ls characters
...
shard machine uuid name primary
-inf-m 878dc732-dab4-4dfd-af66-49fe695e2863 Kunkka yes
m-+inf 760701e5-9398-45a9-bf36-fc57afcf5809 Riker yes
We have two shards, one on each machine! All documents with a primary key less than or equal to 'm' will go onto machine Kunkka, and all documents with a primary key greater than 'm' will go onto Riker.
Note that since we didn't specify a primary key attribute when we created the table, RethinkDB automatically generates a randomized key for each document stored in the attribute 'id'. The table will be sharded by this key. (The architecture FAQ has more information on how RethinkDB does sharding).
Parallelism, chaining, and aggregation
The wonderful thing about sharding in RethinkDB is that you can shard your tables without making any changes to the application. Queries will automatically be distributed across the cluster, executed in parallel, and the results will be combined and returned. For example, if you rerun the join query above, you'll get the same results despite the fact that data may now be on different machines distributed across the network.
Another nice thing about the parallelization engine is that queries can get
arbitrarily complex and they'll still be parallelized and distributed by the
system. We already saw an example of chaining queries when we updated the
number of TNG episodes— we first used filter to specify which records
we wanted to update, and then ran update on the filtered records (a query
that was parallelized across shards as well).
Let's chain our join query to do grouping— instead of listing
characters, let's figure out how many characters each show has:
r.table('tv_shows').innerJoin(r.table('characters'),
function(show, character) {
return show('name').eq(character('show'))
})
.zip()
.groupBy('show', r.count)
This query will grab data from the right shards, do cross-shard joins, and do parallelized aggregation. Easy as pie!
Map/reduce
The groupBy command above isn't your grandma's
groupBy command. It's built on top of a fully
parallelized Hadoop-style map/reduce infrastructure. The query above
is actually syntactic sugar for the following map/reduce command:
r.table('tv_shows').innerJoin(r.table('characters'),
function(show, character) {
return show('name').eq(character('show'))
})
.zip()
.groupedMapReduce(
function(doc){ return doc('show') }, // group mapping
function(doc){ return r.expr(1) }, // document mapping
0, // base value
function(acc, val) { return acc.add(val) }) // reduction
Note that groupedMapReduce makes full use of the MVCC infrastructure,
so you can run it on top of a live system completely lock free (read
more about MVCC in the architecture FAQ).
Next steps
Note: the next time you start RethinkDB, it will look for the
rethinkdb_data directory in the current directory. If it finds the
rethinkdb_data directory left over by this tutorial, it will expect
a second node to join the cluster.
Phew. Of course you wouldn't use the Data Explorer to write actual applications. Take a few minutes and learn how to use the client drivers from your favorite programming language.