All about lambda functions in RethinkDB queries

It’s no secret that ReQL, the RethinkDB query language, is modeled after functional languages like Lisp and Haskell. The functional paradigm is particularly well suited to the needs of a distributed database while being more easily embeddable as a DSL than SQL’s ad hoc syntax. Key to functional programming’s power and simplicity is the anonymous (aka lambda) function.

This post covers all aspects of lambda functions in ReQL from concept to implementation and is meant for third party driver developers and those interested in functional programming and programming language design. For a more practical guide to using ReQL please see our docs where you can find FAQs, screencasts, an API reference, as well as various tutorials and examples.

All examples make use of the official Python driver, but nothing in this post is Python-specific and all examples should translate directly to the other officially supported languages.

What’s the point of lambda functions in a database query language?

To grok ReQL, it helps to understand functional programming. Functional programming falls into the declarative paradigm in which the programmer aims to describe the value he wishes to compute rather than prescribe the steps necessary to compute this value. Database query languages typically aim for the declarative ideal since this style gives the query execution engine the most freedom to choose the optimal execution plan. But while SQL achieves this using special keywords and specific declarative syntax, ReQL is able to express arbitrarily complex operations through functional composition. To understand the contrast, consider the following SQL query.

SELECT * FROM users WHERE users.age >= 18

The body of the WHERE clause aims to describe the properties rows from the table must satisfy to be of interest to the rest of the query. The SQL engine might compile this query to some version of the following imperative algorithm.

results = []
for user in users:
    if user['age'] >= 18:
        results.append(user)

Those familiar with functional programming idioms might recognize this pattern as equivalent to a filter which abstracts away the for loop, if test, and result aggregation. Only the actual test, provided to filter as a boolean function on a single element, must be supplied by the user.

Python supports two different syntaxes for specifying code blocks. The older def fun(args...): ... statement and a newer lambda args...: ... expression. The key advantage of the latter syntax is that it is an expression that evaluates to a first class Python value that can be saved to a variable, passed to functions, and declared in-line. With this feature, the ReQL version of the original SQL query can be expressed succinctly with native Python syntax.

r.table('users').filter(lambda user: user['age'] >= 18)

This leads us to the first and most important answer to our question. Used correctly, lambda functions allow a complex algorithm to be easily decomposed into its constituent parts. This supports the goal of a declarative syntax by splitting out code unique to the computation at hand from the boilerplate best left to the query execution engine.

The main advantage of functional programming though comes when we we begin to compose functions to create more complex queries. Consider what happens when we modify the original SQL query to select just one field from each row.

SELECT team_id FROM users WHERE users.age >= 18

To do the equivalent in ReQL we transform the stream of rows produced by the filter operation with a map operation, perhaps the best known functional idiom, that “selects” the team_id field from each row to produce a stream of just these values.

r.table('users').filter(lambda user: user['age'] >= 18).map(lambda user: user['team_id'])

The original stream of values from the table is first fed into the filter and transformed into a stream of just rows that match the pattern. This result is then fed into the map which transforms it into a stream of just the relevant fields. This technique can be repeatedly applied to construct more and more complex queries. Let’s take it further and actually do something useful with those team id’s by getting the names of all teams with at least one adult member.

SELECT DISTINCT teams.team_name
FROM users, teams
WHERE users.age >= 18 AND
    users.team_id = teams.team_id

The following ReQL translation emphasizes the chain of operations this implies.

(r.table('users').filter(lambda user: user['age'] >= 18)
  .map(lambda user: user['team_id'])
  .map(lambda team_id: r.table('teams').get(team_id))
  .map(lambda team: team['team_name'])
  .distinct())

Notice how Python’s object oriented syntax supports composition through chaining over the nesting style of Lisp or C, avoiding the telescoping parens so closely associated with Lisp and functional programming generally. More canonical ReQL would collapse the sequential maps to leave a terser expression.

(r.table('users').filter(lambda user: user['age'] >= 18)
  .map(lambda user: r.tables('teams').get(user['team_id'])['name'])
  .distinct())

This query achieves the same result as the original SQL without being any more prescriptive about how to execute the query. Furthermore, it is expressed as 100% native Python using the built-in tools and syntax Python developers are already familiar with.

Variable binding

The simplest ReQL API function that takes a lambda function as an argument is do which immediately applies its function argument to its receiver. It can be thought of as a one element map. Let’s look at a simple example.

>>> r.expr("foo").do(lambda x: x + "bar").run()
'foobar'

This curious little tool actually serves a much wider purpose than it seems to at the surface for it is the way to bind variables and manage variable scopes in ReQL.

To illustrate how this is useful, let’s say you want to square a ReQL number. This requires referencing the value twice. You can recompute the value to make the second reference, i.e. (...some query...) * (...repeated...), but this is wasteful and wouldn’t work at all if you had to rely on a non-deterministic computation. Instead, we can compute the value once and bind it to a function argument by using do.

>>> r.expr(12).do(lambda x: x * x).run()
144

The lambda functions used in other constructs like map and reduce also create variable scopes. These can be nested with do to manage variables of different lifetimes. Let’s say that you want to transform a sequence of values into percentages of the whole by dividing through the sum. By using a do to bind the sum we can refer to it in each invocation of the mapping function without having to recompute it.

>>> r.expr([1,2,3,4]).do(lambda seq:
...     seq.reduce(lambda x,y: x + y).do(lambda total:
...         seq.map(lambda val: val / total)
...     )
... ).run()
[0.1, 0.2, 0.3, 0.4]

A similar technique is used to pass ReQL values to JavaScript terms. The ReQL JavaScript term passes a text string representing a JavaScript snippet to V8 as if you’d called eval within your JavaScript environment. In order for this feature to be useful there must be some way to pass values from the ReQL stack to the JavaScript environment. The solution, as you might have guessed, is to use lambda functions. Any JavaScript term that evaluates to a JavaScript function object can be used anywhere a ReQL lambda function is expected, including in do. Here’s how we might rewrite the original squaring example to use JavaScript.

>>> r.expr(12).do(r.js('(function(x) { return x * x; })')).run()
144

A more realistic example would actually leverage the additional value that JavaScript execution provides. Let’s say you want to find all users in your user table with Scottish sounding names (which we define as containing a ‘Mc’ or ‘Mac’ prefix). Though regular expressions are not currently available in ReQL natively we can access the functionality through ReQL’s JavaScript support. To pass the name string to the JavaScript snippet we will define a JavaScript function that accepts it as an argument. Since filter expects a lambda function anyway, we can simply pass the JavaScript term directly to the filter in place of an ordinary ReQL function.

>>> r.table('users').filter(r.js("""(function(user) {
...    return /^((Mac)|(Mc))[A-Z]/.test(user.last_name);
... })""").run()
[{'user_id': 1, 'first_name': 'Douglas', 'last_name': 'MacArthur'}, ...]

How lambda functions are processed by the drivers

Those who have made it this far may be by now wondering just how the pure Python code samples given above are actually transformed into something executable by the RethinkDB server. While this process may be of primary interest to third party driver developers, a peek under the hood will serve casual ReQL programmers as well.

Unlike SQL, ReQL is not transmitted to the database server as plain text. Instead, ReQL client drivers are responsible for implementing an embedded domain specific language and serializing queries to a binary wire format using Google’s excellent pan-lingual Protocol Buffer serialization format.

While there is no canonical text representation of the ReQL wire format we can use S-expressions to describe it’s structure. As an example, here’s how a simple mathematical query would be translated by the Python driver.

r.expr(2) * (2 % 3)
(mul 2 (mod 2 3))

Lambda functions passed to API functions like do and filter also get serialized in this way. Under the hood, do is translated to a term called “funcall”, and it’s lambda function argument to a term called “func”. A func term is comprised of two pieces, an array of argument names (given as integers) and a term giving the body of the function. The body term may then reference the arguments by constructing var terms with the appropriate argument names. This is the same squaring query from above translated the wire format.

r.expr(12).do(lambda x: x * x)
(funcall (func [1] (mul (var 1) (var 1))) 12)

Understanding just how the driver translates a Python lambda function into this serialized form requires a little bit of understanding of how it translates other bits of the embedded language.

Just as the symbols in your Python source code represent the values that will eventually be computed during runtime, the ReQL query objects constructed at runtime represent the values that will eventually be computed when the query is executed on the server. These objects are not merely references though, they actually encode the whole process of computing the value. Further operations on these “values” augment the computation represented and return an object representing the modified computation. The result is a tree where each node represents a step in the computation. In fact, this result looks just like the serialized S-expression format presented above in tree form. Serializing this on the wire is then just a simple depth first pass over the tree.

In principle, lambda function serialization is no different, though the process may be unintuitive at first glance. First, var terms are constructed for each of the function’s formal arguments. The actual Python lambda function is then invoked on these variable references and the return value is used as the body term of the ReQL lambda function. Remember, this value doesn’t just represent the value returned by the lambda function, it also encodes the whole process of computing that value, fully encapsulating the computation represented by the lambda function.

Let’s walk through a more comprehensive example to see how this works. Here’s how we might compute the average age of all the users in your user table.

r.table('users').map(lambda user: user['age']).reduce(lambda x,y: x + y) /
    r.table('users').count()

To construct the map node we must serialize the one argument lambda function that extracts the ‘age’ field from a single row in the table. The first step is to construct a variable reference to represent the “user” argument, the node (var 1). Next we invoke the actual lambda function on this node. The body of the function then calls the overloaded __getitem__ method on the var term. This constructs a node with the form (getattr (var 1) 'age') which becomes the body of our function. The map term encloses both the table reference and this function term as its arguments.

(map (table 'users') (func [1] (getattr (var 1) 'age')))

This result is then passed to reduce which is constructed in a similar manner though this time with two arguments.

(reduce (map ...) (func [2,3] (add (var 2) (var 3))))

The last step is to divide this result by the size of the table for the final query.

(div
    (reduce
        (map
            (table 'users')
            (func [1] (getattr (var 1) 'age'))
        )
        (func [2,3] (add (var 2) (var 3)))
    )
    (count (table 'users'))
)

Glossed over above is the algorithm for assigning integer variable names to each of the arguments. This is actually more difficult than you might think. Within the executing Python code the original names assigned by the programmer have been lost and the driver must come up with an alternative scheme for assigning names. While intuition suggests simply using a counter that numbers arguments one by one for each function, this approach is prone to a problem known as variable capture. Consider a simple query that makes use of nested scopes.

r.expr("foo").do(lambda x: r.expr("bar").do(lambda y: x))

The inner function is supposed to return the first argument of the outer function, “foo”, not the first argument of the inner function, “bar”. Can you spot the problem that occurs when we use a separate argument counter for each function?

(funcall (func [1] (funcall (func [1] (var 1)) "bar")) "foo")

We’ve created an ambiguity between the two arguments by assigning them the same name and now can’t know to which the variable is meant to refer. The simplest way to resolve the problem is to assign all arguments a unique name and the easiest way to do this is with a global counter.

The process of serializing lambda functions works exactly the same way in all the official drivers and we encourage third party driver developers to do something similar. Having ReQL lambda functions represented by native lambda functions in the host language vastly improves legibility of queries and reduces confusion. As you may have noticed, it can even be hard to tell that ReQL code is not native to the host language!