High scalability: SQL and computational complexity

Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.

Recently there has been a lot of discussion on fundamental scalability of traditional relational database systems. Many of the blog posts on this topic give a great overview of some of the immediate issues faced by engineers while scaling relational databases, but don’t dissect the problem in a systematic way and with sufficient depth to get to the core issues. I’d like to dedicate a series of blog posts to the problem of scalability and how it pertains to relational databases.

There are a number of aspects of an RDBMS that are relevant to high scalability:

  • Relational data model
  • Constraint enforcement (including referential integrity)
  • SQL and its operational semantics
  • ACID compliance

In addition, every aspect above must be discussed in the context of the following attributes:

  • A specific usage pattern
  • The implementation (a concrete system vs. a theoretically possible ideal)
  • Hardware platform (few expensive machines vs. many cheap machines)

In these series I will deal exclusively with OLTP (realtime) use cases. I will not discuss various forms of analytics, data mining problems, etc. For this post, let’s define realtime as O(k * log N), where k is some small constant that represents a well defined number of queries, and N is the total size of the data (in rows). In other words, all operations of a realtime database must completely evaluate in a logarithmic time relative to the full size of the dataset, and the number of such operations required to make a logically complete business transaction is small and independent of the size of the dataset. I chose this definition because it seems intuitively interesting – other functions that we see in practice are unlikely to satisfy realtime demands of real world systems. In future blog posts I’ll restrict this definition even further to account for constant factors, but for now reasoning in terms of complexity theory is sufficient.

There are two aspects of high scalability I’d like to cover – specific issues and problems relevant today, and what we can expect from data management systems in the future. In order to cover these aspects, I will focus both on concrete systems commonly used in production and on theoretical ideals. For completeness, I will also discuss most issues in terms of horizontal and vertical scaling. Every time I talk about an RDBMS, I will define these contexts explicitly.

Since I’m a big fan of theory of computation and programming language theory, I’ll kick off the series with a discussion on scalability of SQL from the perspective of theory of computation (here I use the acronym ‘SQL’ in its strictest sense and mean the actual query language as defined by the ANSI standard and implemented by most vendors).

From a purely theoretical, computational perspective, the ANSI SQL-92 is equivalent to a primitive recursive language. Most real world SQL implementations are even more expressive, and are Turing-complete. As far as vertical scalability is concerned, SQL is simply too expressive. Even if we restrict ourselves to SQL-92, it is possible to write queries of polynomial, or even exponential complexity – a far cry from a logarithmic requirement we established earlier. This means that according to our definition of real-time, SQL is fundamentally not a vertically scalable language.

What about horizontal scalability? Reasoning about it is more difficult because it involves somewhat esoteric computational classes, and requires additional assumptions. To simplify the reasoning, we make one key assumption. We assume that we can only have a polynomial number of machines (and cores) – this appears to hold true because the number of machines we can manufacture is dwarfed by the astronomical amounts of information we consume. If this holds true, even if we can trivially parallelize each query, an exponential function (the amount of information) divided by a polynomial (number of machines) still dominates the logarithmic function we defined earlier as acceptable. This means that given modern trends, if a given query isn’t scalable vertically, it also isn’t scalable horizontally, which makes SQL fundamentally unscalable, period.

Of course so far we’ve shown what we already know – that it is possible to write SQL queries that will likely never be fast enough to evaluate in practice. At first glace this doesn’t appear very useful – all we have to do is avoid writing such queries and use the subset of SQL that can be evaluated in logarithmic time. Unfortunately from a theoretical (and far too often practical) perspective doing this is impossible.

The culprit is SQL’s lack of operational semantics. Even a simple point query can (and often does) run in O(1) time for hash indexes, O(log N) for tree indexes, or O(N) for a linear scan. For more complicated queries, there are too many edge cases where the optimizer might magically switch from logarithmic to linear execution on a whim, despite having an index available. In practice, these changes result in expensive downtime, and hours of debugging and rearchitecting. For massively scalable realtime systems, this is SQL’s Achilles’ heel – you can’t use a subset of SQL that runs in logarithmic time – some of the time (in practice, far more often than you’d like), you’ll end up writing queries that don’t satisfy your requirements. If we do settle on a declarative query language (and I believe that anything else is a huge step backward) for massively scalable systems, it must have the property that any query you could express in this language is guaranteed to evaluate in logarithmic time.

Of course such a language has a significantly limited purpose. It cannot be used for most analytics problems, and more importantly for realtime systems, cannot be used for realtime problems which involve polynomial islands of data in the exponential universe. Facebook may some day have billions of users, but any given user is unlikely to have more than a thousand friends. In this scenario there are realtime subproblems where linear, and loglinear queries are perfectly acceptable, and our language can’t handle them. This means that for these subproblems, one must use a different system, which may or may not be an acceptable solution. Perhaps a better solution is to design a database system that only allows to run provably logarithmic queries for massive datasets, but relaxes the requirement for smaller subsets of data.

Unfortunately it would be extremely difficult to modify SQL to satisfy this behavior because it isn’t modular – almost all additions are special forms, and it is so far removed from any theoretical model (including relational algebra), that reasoning about it in a rigorous way is extremely difficult for both humans and compilers. My prediction is that systems of the future will use a modular, verifiable, higher order query language capable of enforcing various complexity requirements at compile time, and that it will not look very much like SQL.

Interested in working at RethinkDB? We’re hiring – please see our jobs page for more details.

10 Responses to “High scalability: SQL and computational complexity”

  1. The culprit is SQL’s lack of operational semantics.

    I don’t think this is quite right. It would be quite possible to define an operational semantics for SQL, although I’m not aware of any attempts to do so offhand: operational semantics is just a way to formally define the meaning of programs in terms of the operations of an abstract machine. The problem you’re raising is that SQL does not provide performance bounds/guarantees for queries — that is orthogonal to the presence of an operational semantics for the language.

    If we do settle on a declarative query language … for massively scalable systems, it must have the property that any query you could express in this language is guaranteed to evaluate in logarithmic time.

    You might be interested in PIQL, a recent query language for web applications from UC Berkeley that does just as you suggest. http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-8.html

  2. Neil: my mistake. It is, of course, possible to define performance bounds without defining operational semantics.

    PIQL looks very interesting, I will take a look.

  3. If I’m not mistaken, there’s a flaw in your reasoning about horizontal scalability. Having a polynomial (in what?) number of machines could reduce the time for a query by a polynomial amount. Your argument is that since there is an exponential (in what?) amount of data, the resultant running-time will still be non-scalable. However, as you clearly know, the running-time of a query is not proportional to the amount of data. If a query takes polynomial time on a single machine, it’s still not vertically scalable (according to your definition), but distributing it over a polynomial number of machines could reduce the running-time to make it “realtime”.

  4. Andrew: great point. I wanted to avoid getting into various computational classes and redefining the computational model with respect to multiple machines, so I used this simplification. Thanks for pointing out the flaw.

  5. It’s an interesting theoretical property of SQL that you highlight but it is one that SQL shares with most other languages on the standards level. Complexity guarantees are a feature of particular language/runtime implementations and often require the use of implementation specific hints. This is primarily a portability issue, not a scalability issue in my view.

  6. I may be confused, but isn’t this 180 degrees wrong? My understanding is that one of SQL’s greatest strengths in the performance game is the fact that it doesn’t have operational semantics that apply to the query alone, but also the data and environment.

    Because the plan to implement a query can depend on the data’s locality and stats, you can perform better queries than you could ever plan for in advance. The locality strikes me as a particularly important fact when thinking of distributed databases. The separation of what you want to do (the query) from how (the operations) is a good idea, and should be preserved in my opinion when re-thinking the future of databases. Do you think this conflicts with the desire for operational semantics or not?

  7. > Even if we restrict ourselves to SQL-92, it is possible to write queries of polynomial, or even exponential complexity – a far cry from a logarithmic requirement we established earlier. This means that according to our definition of real-time, SQL is fundamentally not a vertically scalable language.

    Come on, you’re grasping at straws here. What we care about is whether the restriction of SQL to “real-time queries only” is expressive enough to handle our tasks. Just because it’s possible to do a linear table scan doesn’t mean it’s wise to do it.

  8. JKF: I agree with Neil’s comment above – what’s needed isn’t an operational semantics specification, but a complexity bounds specification. This is what I should have said in the first place.

  9. Sorry, but I think you – and a lot of people out there – overestimate the power of any query language. QLs had been invented to make the life easier, not to do magical things the programmer never ask for. If you look at our graph query language (GQL – http://developers.sones.de/2010/02/24/get-a-taste-of-graph-databases-infogrid-neo4j-and-sones-graphdb/ ), which inherits a lot of ideas from SQL, you will see that it is much more expressive than doing the same operations via a normal programming language. That’s all… no more goodies.
    If you strip any QL down to operations that will always operate within O(log n) on the read path, you can take this speed from the write path and/or by increasing its memory requirements (index everything – even intermediate results – and so on… ). But in the end you will notice that this QL can no longer answer the really interesting questions as most real life problems are trivial ( == no QL needed) or not within O(log n).

  10. [...] RethinkDB – The database for solid state drives.In this scenario there are realtime subproblems where linear, and loglinear queries are perfectly [...]

Leave a Reply