Troubleshooting common RethinkDB problems

How can I get a dump of the RethinkDB system tables?

This can be useful for diagnostic purposes, as well as for filing bug reports. The easiest way to do this is with ReQL administration commands. Any individual table can be examined with r.db('rethinkdb').table(<tablename>).

The following command will output the contents of all the configuration/status tables as well as the most recent 50 lines of the logs table:

r.expr(["current_issues", "jobs", "stats", "server_config", "server_status",
"table_config", "table_status", "db_config", "cluster_config"]).map(
    [r.row, r.db('rethinkdb').table(r.row).coerceTo('array')]
).coerceTo('object').merge(
    {logs: r.db('rethinkdb').table('logs').limit(50).coerceTo('array')}
)

(That command is suitable for running in the Data Explorer, but can be easily adapted into other languages.)

I get a “ReqlResourceLimitError: Array over size limit 100000” when trying to order a table

Ordering without an index requires the server to load the whole sequence in an array, which is limited by default to 100,000 documents. You can use the arrayLimit option to run to temporarily raise this limit. However, a more efficient option is to use an index. See the documentation for orderBy for more information.

My insert queries are slow. How can I speed them up?

RethinkDB uses a safe default configuration for write acknowledgement. Each write is committed to disk before the server acknowledges it to the client. If you’re running a single thread that inserts documents into RethinkDB in a loop, each insert must wait for the server acknowledgement before proceeding to the next one. This can significantly slow down the overall throughput.

This behavior is similar to any other safe database system. Below is a number of steps you can take to speed up insert performance in RethinkDB. Most of these guidelines will also apply to other database systems.

Increase concurrency. Instead of having a single thread inserting data in a loop, create multiple threads with multiple connections. This will allow parallelization of insert queries without spending most of the time waiting on disk acknowledgement.

Batch writes. Instead of doing single writes in a loop, group writes together. This can result in significant increases in throughput. Instead of doing multiple queries like this:

r.db("foo").table("bar").insert(document_1).run()
r.db("foo").table("bar").insert(document_2).run()
r.db("foo").table("bar").insert(document_3).run()

Combine them into a single query:

r.db("foo").table("bar").insert([document_1, document_2, document_3]).run()

RethinkDB operates at peak performance when the batch size is around two hundred documents.

Consider using soft durability mode. In soft durability mode RethinkDB will acknowledge the write immediately after receiving it, but before the write has been committed to disk. The server will use main memory to absorb the write, and will flush new data to disk in the background.

This mode is not as safe as the default hard durability mode. If you’re writing using soft durability, a few seconds worth of data might be lost in case of power failure.

You can insert data in soft durability mode as follows:

r.db("foo").table("bar").insert(document).run(durability="soft")

Note: while some data may be lost in case of power failure in soft durability mode, the RethinkDB database will not get corrupted.

Consider using noreply mode. In this mode, the client driver will not wait for the server acknowledgement of the query before moving on to the next query. This mode is even less safe than the soft durability mode, but can result in the highest performance improvement. You can run a command in a noreply mode as follows:

r.db("foo").table("bar").insert(document).run(noreply=True)

You can also combine soft durability and noreply for the highest performance:

r.db("foo").table("bar").insert(document).run(durability="soft", noreply=True)

How can I order the output of group?

Commands chained after group operate on each group separately. If you want to operate on all the groups at once (e.g. to order them), you need to call ungroup before doing so.

What does ‘received invalid clustering header’ mean?

RethinkDB uses three ports to operate—the HTTP web UI port, the client drivers port, and the intracluster traffic port. You can connect the browser to the web UI port to administer the cluster right from your browser, and connect the client drivers to the client driver port to run queries from your application. If you’re running a cluster, different RethinkDB nodes communicate with each other via the intracluster traffic port.

The message received invalid clustering header means there is a port mismatch, and something is connecting to the wrong port. For example, it’s common to get this message if you accidentally point the browser or connect the client drivers to the intracluster traffic port.

Does the web UI support my browser?

The following browsers are supported and known to work with the web UI:

  • Chrome 9 or higher
  • Firefox 15 or higher
  • Safari 6.02 or higher
  • Opera 1.62 or higher

The web UI requires DataView and Uint8Array JavaScript features to be supported by your browser.

Which versions of Node.js are supported?

The JavaScript driver currently works with Node.js versions 0.10.0 and above. You can check your node version as follows:

node --version

You can upgrade your version of Node.js via npm:

sudo npm install -g n

If you’re trying to run the RethinkDB JavaScript driver on an older version of Node.js, you might get an error similar to this one:

/home/user/rethinkdb.js:13727
return buffer.slice(offset, end);
             ^
TypeError: Object #<ArrayBuffer> has no method 'slice'
at bufferSlice (/home/user/rethinkdb.js:13727:17)
at Socket.TcpConnection.rawSocket.once.handshake_callback (/home/user/rethinkdb.js:13552:26)

I get back a connection in my callback with the Node driver

Many people have been reporting that they get back a connection object when they run a query, the object being:

{
    _conn: {
        host: 'localhost',
        port: 28015,
        db: undefined,
        authKey: '',
        timeout: 20,
        outstandingCallbacks: {},
        nextToken: 2,
        open: true,
        buffer: <Buffer 04 00 00 00 08 02 10 01>,
        _events: {},
        rawSocket: { ... }
    },
    _token: 1,
    _chunks: [],
    _endFlag: true,
    _contFlag: true,
    _cont: null,
    _cbQueue: []
}

This object is not a connection, but a cursor. To retrieve the results, you can call next, each or toArray on this object.

For example you can retrieve all the results and put them in an array with toArray:

r.table("test").run( conn, function(error, cursor) {
    cursor.toArray( function(error, results) {
        console.log(results) // results is an array of documents
    })
})

RethinkDB is running out of memory

You may need to adjust RethinkDB’s page cache size, using the --cache-size argument or configuration file option. Read “Understanding RethinkDB memory requirements” for a more detailed explanation of how RethinkDB uses memory and how to tune its performance.

If you’re running RethinkDB on Linux and see a “Data from a process on this server has been placed into swap memory” warning in the System issues table even though your server has RAM available, it’s possible you need to adjust the swappiness kernel parameter. A swappiness setting of 0 prevents swap space from being used unless the server is completely out of physical memory; a setting of 100 uses swap space all the time. To check the swappiness of your kernel:

$ cat /proc/sys/vm/swappiness
60

A setting of 60 (the default for Ubuntu) means that your system will start using swap when RAM usage is at about 40%. If you’d like this to be closer to 90%, set the swappiness to 10. You can do that by editing the /etc/sysctl.conf file (as root) and change the setting there:

vm.swappiness = 10

This change won’t take effect until you reboot. You can change it while the system is still running, also:

$ sysctl vm.swappiness=10
$ swapoff -a
$ swapon -a

I get incorrect results when I pass functions with if/for statements to ReQL

When you pass functions to ReQL, your language’s driver serializes those functions into ReQL lambda functions that are run on the server, not in your client language. (See All about lambda functions in RethinkDB queries for more details.) A consequence of this is that native language constructs like if and for will not produce the expected result when their conditions involve ReQL commands. While they may not cause errors, they will be executed on the client side before the function is compiled for ReQL, and thus give an incorrect result. Instead, you must use equivalent ReQL control functions such as branch and forEach. Here’s an example in Python from the Introduction to ReQL document:

# WRONG: Get all users older than 30 using the `if` statement
r.table('users').filter(lambda user:
    True if user['age'] > 30 else False
).run(conn)

# RIGHT: Get all users older than 30 using the `r.branch` command
r.table('users').filter(lambda user:
    r.branch(user['age'] > 30, True, False)
).run(conn)

And an equivalent example in JavaScript:

// WRONG: Get all users older than 30 using the ternary operator
r.table('users').filter(function(user) {
    return (r.row('age').gt(30) ? true : false);
}).run(conn, callback)

// RIGHT: Get all users older than 30 using the `r.branch` command
r.table('users').filter(function(user) {
    r.branch(user('age').gt(30), true, false)
}).run(conn, callback)

(Note we must use gt instead of the native > operator in JavaScript, for the same reason. In Python the > operator is overloaded to be translated to ReQL’s gt command, a trick that is not possible in JavaScript.)

How do I specify an external canonical IP address of a RethinkDB node?

When a RethinkDB node starts, it will broadcast its “canonical” IP address, the address other nodes should use to connect to it. By default, the canonical address is the server’s primary IP address. However, if this address is an internal IP address that isn’t reachable by other nodes (for example, the nodes are on different networks), the nodes will not be able to reach one another. You may receive an error message such as:

error: received inconsistent routing information (wrong address) from xxx.xxx.xxx.xxx (expected_address = peer_address{ips=[xxx.xxx.xxx.xxx], port=29015}, other_address = peer_address{ips=[xxx.xxx.xxx.xxx], port=29015}), closing connection

To solve this, specify the canonical address explicitly by using the --canonical-address argument.

rethinkdb --canonical-address <external IP>

This may also be specified in the config file.

My secondary index is outdated

You may receive a warning message about secondary indexes on startup being “outdated” when you upgrade RethinkDB versions.

warn: Namespace <x> contains these outdated indexes which should be recreated:
<index names>

(This may happen, for instance, between v1.13 and v1.14, when the internal format of secondary indexes changed.) Outdated indexes can still be used—they don’t affect availability. However, you should rebuild your index before updating to the next version of RethinkDB.

You may rebuild indexes with the rethinkdb command line utility:

rethinkdb index-rebuild [-c HOST:PORT] [-r (DB|DB.TABLE)] [-n CONCURRENT_REBUILDS]

The -c and -r options are similar to other rethinkdb options, specifying the cluster host and port (defaulting to localhost:28015) and either a database or a table to rebuild. The -n option specifies the number of rebuilds that will be performed concurrently (defaulting to 1).

You may also rebuild indexes manually in ReQL:

  • Use index_status to retrieve a binary representation of the existing secondary index (whether it is simple, compound, multi, or based on an expression)
  • Create a new index using index_create
  • Rename the new index to the old index’s name with index_rename.

A simple example in Python:

old_index = r.table('posts').index_status('old_index').nth(0)['function'].run(conn)
r.table('posts').index_create('new_index', old_index).run(conn)
r.table('posts').index_wait('new_index').run(conn)
r.table('posts').index_rename('new_index', 'old_index', overwrite=True).run(conn)

(The same example can be found in index_create for both Ruby and JavaScript.)

How do I store a Ruby DateTime object in RethinkDB?

The short answer: you can’t. Use Time objects instead.

The slightly longer answer: there’s only one native time data type in RethinkDB. When a language supports more than one kind of date/time object, we think it’s better to explicitly support one and only one of them in the client driver to avoid confusion. Otherwise, you might insert a DateTime object and get a Time object back.

You can use Ruby’s DateTime.to_time and Time.to_datetime methods to easily convert between one and the other.

Filters with or return incorrect/unexpected results

You might want to use filter to return documents that have one of two (or more) optional fields set, such as the following:

r.table('posts').filter(
    r.row('category').eq('article').or(r.row('genre').eq('mystery'))
).run(conn, callback);

However, if any document in the posts table above lacks a category field, it won’t be included in the result set even if it has a genre field whose value is 'mystery'. The problem isn’t the or command; it’s that the invocation of r.row('category') on a document without that field returns an error, and the rest of the filter predicate isn’t evaluated.

The solution is to add a default to the row command that always evaluates to something other than what you’re testing for, so it will return false if the field doesn’t exist:

r.table('posts').filter(
    r.row('category').default('foo').eq('article').
    or(r.row('genre').default('foo').eq('mystery'))
).run(conn, callback);

“Nesting depth limit exceeded” error

Typically, this indicates that a JSON object with subdocuments is too deeply nested:

{ "level": 1,
  "data": {
    "level": 2,
    "data": {
      "level": 3,
      "data": {
        "level": 4
      }
    }
  }
}

ReQL’s nesting depth is limited to 20 levels. This can be changed with the undocumented nestingDepth (or nesting_depth) option to r.expr(), but before using that, consider whether the document can be reorganized to avoid the error.

It’s also possible for this error to be caused by a circular reference, where a document inadvertently contains itself:

user1 = { id: 1, name: 'Bob' };
user2 = { id: 2, name: 'Agatha' };
user1['friends'] = [ user1, user2 ];

Trying to access user1 in ReQL will cause a nesting depth error.

Depending on the driver, this error may also appear as “Maximum expression depth exceeded.”

“RqlTzinfo object is not JSON serializable” error

If you try to serialize a document containing a ReQL time zone object using Python’s json library, you may receive this error. Solve this by passing the time_format="raw" option to run:

import json
today = r.expr(datetime.datetime.now(timezone('US/Pacific'))).run(conn,
    time_format="raw")
json.dumps(today)

'{"timezone": "-07:00", "$reql_type$": "TIME", "epoch_time": 1433368112.289}'

“Cannot use r.row in nested queries” error

The JavaScript and Python drivers support a convenience command, row(), which simply returns the currently selected document for use with other ReQL functions in the query. However, row won’t work within nested queries. The solution to this error is to rewrite the row clause as an anonymous function. So the following:

r.table('users').filter(
    r.row['name'] == r.table('prizes').get('winner')
).run(conn)

Can be rewritten with this function instead:

r.table('users').filter(
    lambda doc: doc['name'] == r.table('prizes').get('winner')
).run(conn)

Any query, nested or otherwise, can be written with an anonymous function instead of row. (The official Ruby and Java drivers don’t include row at all.)