Is an all-in-one database the future?

Why does an ideal generalist database remain elusive? Our new article explores the rise of purpose-built databases and the messy data infrastructure that results. We examine classification challenges, practical usage categories, and the limitations of naive parallel replication to understand why specialized databases remain crucial. Discover the complexities that keep the perfect all-in-one database out of reach.
Last updated
August 15, 2024
Author

Two things are happening simultaneously in the database space. The first is the emergence of more and more purpose-built databases. As data problems become more defined and difficult, new databases spin out to optimize for those problems. The second, as a result, is a growing pain point of messy data infrastructure, strewn together by third-party data pipelines.

This prompts an interesting question: Will the latter problem of complex architecture push databases toward consolidation? In other words, will there be a generalist database that’s scalable, performant across contexts, and commercially appealing? To be frank, no. But let’s explore why.

The fundamental problem: classification

Before we can talk about consolidation, we first need to understand what exactly we are consolidating. Taking a step back, we encounter a problem that’s plagued database conversations for years: classification.

Classifying databases is difficult. There are multiple sensible ways to categorize them, splicing and dicing the space.

One of the most common sets of categories is relational (e.g., MySQL and ClickHouse) and non-relational (MongoDB and Redis). A similar split is structured data versus unstructured data. Both make sense in the classroom—they are fundamentally different ways to organize both.

But splitting databases like this does little for us from a commercial perspective. Arguably, MySQL and MongoDB have more overlap from a usage standpoint (a primary data store for application data) than MySQL and ClickHouse, the latter being analytics optimized.

The seven groups

Instead, we can categorize databases by how they are practically used. This involves factoring both their data structure and physical constraints (memory location, scalability, etc.). This crudely creates seven groups.

  1. Relational OLTP databases. Think Postgres and MySQL. Ideal for one-to-many and many-to-one relations, but don’t play well with complex relationships, flexible schemas, or aggregations.

  2. Non-relational document databases. Document storage databases like MongoDB and Elasticsearch have flexible schemas, a core benefit to this type of database, but may scale poorly.

  3. Relational OLAP databases.OLAP databases are relational like OLTP databases; however, the data is inverted at the hard drive level. This makes them excellent for aggregations but poor for mutations or retrieving whole entries.
  4. Memory-based cache databases. Designed for speed, Redis and other memory-based cache databases store information in a computer’s memory instead of the hard drive. They tend to use a key-value storage system to quickly write and retrieve data but fall flat on full-text search and have limited scale given that memory is more physically expensive.
  5. Graph databases. These NoSQL databases, such as Neo4j, are predicated on nodes and relationships (a.k.a. edges) between the nodes. They are excellent for recommendation engines and complex data but are not efficient for data that falls neatly into a predefined schema.
  6. Vector databases. Vector databases like ApertureDB and Chroma are databases designed to store vectorized data (a.k.a. embeddings). This genre has exploded in popularity due to the advent of generally available LLMs (like GPT). Vector databases are optimized for vector search, in which data is retrieved based on likeness (determined by calculating the distance between two vectorized values).
  7. Read-only data warehouses. Data warehouses are built for analytics. More specifically, they are built for massive amounts of data ingestion and wide-scale read operations as opposed to serving data to production applications like a traditional database does. However, given that data warehouses are tightly integrated with modern data stacks, they are worthy of a mention here.

Other niche categories are differentiable based on the same standard, but these seven categories define the major database groups that companies today shop for. This usually involves deploying multiple databases or data warehouses.

And that is where the problem lies.

The cobweb mess of modern data

While most companies have a primary data storage system—such as a Postgres or MongoDB instance—they rarely rely on it for every data problem. Incrementally, as customer-facing needs or internal problems grow apparent, data teams create infrastructure solutions to avoid “jamming” the main database. For example, a common query might be expensive on Postgres but efficient on TimescaleDB with the exact replicated data.

This is where pipelines come in. Data is dispatched to a primary data storage system but then piped to other databases for other queries. In other scenarios, data is directly ingested into each data storage unit and then consolidated.

If it involves only two data systems, this may be manageable. Often, however, data infrastructure spirals to three or four database locations—especially when different data modalities such as images or videos need to be supported. At PostHog, for example, their stack includes ClickHouse for storing big data, Kafka for queuing events for ingestion, MinIO for storing files, Redis for caching and inter-service communication, and Postgres as a primary store for user data.

In short, we end up with diagrams like the following:

The arrows that connect data stores aren’t trivial either. Whether it’s Kafka at Posthog, Fivetran at Zoom, or Airbyte at Monday.com, companies rely on event streaming and ELT products to move data from one location to another. These collectively become difficult to manage.

The hypothetical

Let’s start with a question where the answer is an obvious no: Can we build a database that combines all the benefits of the seven categories without compromising on efficiency and cost? No, of course not.

Why? Because databases aren’t just a bag of software features; how their data structures are loaded at the hardware-level dramatically impact performance on certain queries. For instance, ClickHouse is fantastic at aggregations, whereas MySQL is not because of inverted storage. MySQL could run on an imaginary computer that’s 100 times better and still be slower than ClickHouse for calculating an average across a column. It’s a matter of how far the hard disk head needs to travel.

Let’s modify the question for a moment: Could we build a database system that combines all (or a strong majority) of the benefits of the seven categories from the user’s perspective, all while managing parallel instances under the hood to match efficiency?

Probably not. Why? Because of some pretty notable challenges to proposed solutions.

The difficulties of (naive) parallel replication

A naive approach to building an all-in-one database would be for a database to store the same data in multiple data structures in parallel locations. That way, depending on the query, the database could harness the correct data structure.

This quickly becomes fiction when we think about a specific example. Let’s take two database categories that already have a lot in common: OLTP databases and analytical OLAP databases. These are both relationship databases with table-based schemas.

If our all-in-one database simply replicates data in two locations, it may initially seem plausible.

Then, some obvious hindrances will emerge:

  1. The database needs to relinquish ACID requirements and enable an event-queue system for ingesting events. This appeals to the OLAP half but not the OLTP half. Alternatively, events can be treated as transactions, but then events will be ingested more slowly.
  2. Scaling this database becomes trickier. OLTP and OLAP solutions have different sharding requirements since the data scales in inverted directions.
  3. It remains unclear how to handle requests that are efficiently incompatible with OLAP databases such as mutations.

The lesson here is that having a single database store data in two parallel structures is too simplistic of a solution. There’s a reason that modern databases don’t support this—it breaks the way they scale and operate.

This alone, however, isn’t a death sentence for the all-in-one database. Let’s take a step back and think about another data abstraction layer: ORMs.

Lessons from the ORM

ORMs are software programs that promise a unified way to interact with databases by taking a declarative description of a data’s structure and provisioning a database accordingly. For example, an engineer could use Prisma – a popular ORM – to define the structure of a Postgres database (like table names, field names and types, and any foreign key relations), spin up the database according to that structure, and then interact with that database through CRUD operations via the ORM’s API. This contrasts with writing step-by-step imperative SQL commands.

But the ORM isn’t a silver bullet. Specifically, ORMs fall short when it comes to handling non-relational or schemaless databases. Given that non-relational databases have different data models, query languages, performance constraints and mapped relationships when compared to relational databases, it’s challenging for ORMs to seamlessly handle them. While some options like Prisma do work with document-based databases like MongoDB, it stops there. No vector databases, graph databases, the list goes on.

The lesson here is that if ORMs can’t unify data management across all distinct database types, then building an all-in-one database that must handle much more complexity than what an ORM can accomplish is even less likely. In other words, if the simplifying layer fails to simplify, then the underlying, more complex layers definitely can't be simplified or consolidated.

Jack of all queries, master of none?

At the end of the day, a generalist database that tries to accommodate all data needs is actually failing to accommodate any of them well. Why? There are a few issues:

  1. Optimization: Different databases are built for different use cases. Relational databases are great relatively static schemas with clear relationships across tables, while document databases are optimized for schema flexibility. The same goes for time-series databases, which are optimized for sequential data access of, to no surprise, time-stamped data. How would one database handle it all?
  2. Data Model Overhead: A system that handles a number of distinct data models all under one hood would result in the data size becoming too large. Memory and storage required for these data sets could be 50-100x of what it should otherwise be.
  3. Latency: Adjacent to data model overhead is speed. An all-in-one database would need a large amount of resources (e.g. CPU, memory) to try to allow the end user to access any data at any time. Coupled with the larger data size, this would increase latency. Say goodbye to any use case where the business needs access to real-time data.

What’s the closest solution we have today?

Postgres. While it definitely isn’t a full one-stop-shop database, it is a massive project with numerous third-party plugins that can extend it to tackle niche functions.

For instance, pg_vector provides a vector search feature to Postgres by extending Postgres’s existing storage system.

Hydra, a startup, makes Postgres more tenable for analytical queries by extending some OLAP data structures to it.

However, this is where the buzzword “purpose-built” comes in. Postgres isn’t purpose-built to be a jack-of-all-trades. It is a general-purpose database, and it does serve a wide variety of functions, but it isn’t intended to solve every problem efficiently. Proof of this is that companies that use Postgres, such as PostHog mentioned earlier, continue to have complex data stacks.

Closing thoughts

A “database” that orchestrates multiple parallel databases is not feasible. The technical challenges would make it difficult to build. Even if it were an option it would be complex, less performant, and overall would just introduce more headaches than it’s worth. Not to mention, the perceived efficacy of an all-in-one database would likely be lagging of its already subpar evaluation.

That being said, the modern data stack still has its issues. It’s a mess consisting of a number of point-solution SaaS products that all solve an individual data problem optimally. Companies today have determined that stringing them all together is clearly the more functionally sound and commercially viable path. But that’s what remains difficult – the stack needs to be managed effectively for that to hold true.

Keep reading

See all stories