Relational vs. Non-Relational Databases
8 min read
Update: Consider reading my post on Database Scaling Strategies when you're done here. 😉
I’ve been focusing a lot more on backend development work during the last few months, and improving my knowledge around Python, FastAPI and databases. The vast majority of my experience so far, with backend databases, has been around relational databases. Recognising that there’s a host of other, newer database options out there, I took some time to take a deeper dive into the different types of databases and why we might choose one over the other.
The categories of databases I looked at were relational vs. non-relational databases.
- Also known as relational database management systems (RDBMS), relational databases can be thought of as data stored in spreadsheet-like tables.
- Data is stored in discrete tables, where columns are fields and rows are individual entries (also known as records).
- Data across tables can be linked through a foreign key field, which is a unique identifier for a record in a secondary table (its primary key).
- Structured Query Language (SQL) is normally used to run operations (read, create, update, delete) on the data. These types of relational databases are also known as SQL databases.
- Speed of data retrieval can be improved by using a database index.
- Examples of relational databases - Oracle, MySQL, Postgres.
- This category of databases typically covers any database that does not use a relational model. They are also known as NoSQL (Not Only SQL) databases.
- There are a range of NoSQL database types which store information in a variety of ways, with the most common ones being:
- Document datastore
- Key-value store
- Graph datastore
- Column-oriented database
Document data store
- Documents can be grouped together into collections (typically because they have similar content). Note however, that documents within a collection do not need to have identical data structures allowing for flexibility (no fixed schema).
- Example: MongoDB
- Use cases: Customer data management and personalisation, real-time analytics.
- A single attribute is used as a key, which maps to a value. Every key needs to be unique.
- The collection of key-value pairs are contained in an object.
- Querying and writing to the store is extremely fast due to its simplicity. Querying data happens at constant time.
- Complex data structures however, cannot be supported.
- Example: Redis, Memcached
- Use cases: Session store for session oriented applications where user login is required, data cache for rarely updated data, product recommendations.
- Data is stored as objects and edges (representing the relationships between objects).
- Flexible but can be difficult to query the entire database, especially if relationships between objects are not known or non-existent.
- Query language is specific to the database provider (no standardised querying language across graph databases).
- Example: Neo4j
- Use cases: social networks, recommendation engines.
- Similar concept to relational database, but with data organised into columns instead of rows.
- Different rows in the same table do not need to have the same columns. This allows for denormalised data storage (which decreases size and cost of data).
- Columns can be stored in separate servers allowing for horizontal scaling.
- Example: Cassandra
- Use cases: Large volumes of data where you can query data by columns.
Relational vs. Non-Relational Databases
Differences between relational and non-relational databases
- Relational databases have fixed schemas, whereas NoSQL databases can be schema-agnostic.
- SQL is typically used to run database operations on relational databases. As their name implies, this is not the case for NoSQL databases.
- Relational databases rely on vertical scaling (migrating to a larger server with more CPU, RAM or SSD), whereas NoSQL database can much more easily utilise horizontal scaling (through adding additional servers or nodes).
- Relational databases can also scale horizontally through sharding and partitioning logic, but it’s not easily done.
- Both types of databases can utilise replication (another horizontal scaling strategy) to increase availability by storing duplicates of up-to-date data.
- Relational databases enforces ACID properties whereas this is not necessary with NoSQL.
- CAP theorem is something that could be used to guide which NoSQL database is most appropriate for your use case (see more below).
Pros of relational databases
- Very popular and widely used.
- Well-suited and optimised for storing structured data.
- Better for guaranteeing data normalisation (limiting duplicate data and anomalies) due to strict schema requirements.
- ACID (Atomicity, Consistency, Isolation, and Durability) compliance - all changes must succeed for the transaction request to persist meaning better data consistency.
- Especially important for say, banking transactions.
- Atomicity: All transactions must succeed or fail completely and cannot be left partially complete, even in the case of system failure.
- Consistency: The database must follow rules that validate and prevent corruption at every step.
- Isolation: Concurrent transactions cannot affect each other.
- Durability: Transactions are final, and even system failure cannot “roll back” a complete transaction.
- Enforces data accuracy by referential integrity through the following rules:
- A foreign key must have a corresponding primary key (prevents “orphaned records”).
- If a record in a primary table is deleted, any record referencing that primary key must also be deleted.
- If a primary key changes, all other records referencing that primary key must be updated.
- Can be indexed to improve query response times.
Disadvantages of relational databases
- Scaling the database (due to increase in data size or increase in access frequency requiring more computing power) requires improving the database machine hardware (vertical scaling). There will eventually be a limit to this due to cost or hardware limitations.
- Fixed schema makes it difficult for structure of data to change down the line. You’ll therefore need to have a good idea of what your data will look like before setting up your databases. Any changes to the schema may require the database to go offline for maintenance.
- Increasing data size and complexity increases the time taken to query the database.
Pros of non-relational databases
- Flexible, can store unstructured and semi-structured data, and grow with evolving application data (no schema needed).
- More performant for large datasets. Can offer faster write speeds due to more flexible schemas.
- Can be especially useful in streaming / real-time applications where data is constantly being written.
- Can be scaled horizontally i.e. databases can be duplicated across multiple servers.
- Distributed databases increases fault tolerance.
- High availability means data can be served across multiple platforms.
- Downside is that these databases will lean towards eventual consistency.
When to choose relational vs. non-relational databases?
Generally speaking, I’d say that if your dataset is structured, and you don’t expect your dataset to grow to a point where vertical scaling will be prohibitive, go with a relational database. Relational databases are also the better choice if ACID transactions are a high priority, or if you have complex, but structured querying needs.
NoSQL databases are a better choice if you’re working with massive datasets that need to be available fast. They are also more useful if your data structure is not fixed, or if you need to optimise for writes rather than reads.
When choosing which NoSQL database to use, one thing to consider is the CAP theorem, which is about the tradeoff between Consistency, Availability and Partition Tolerance (see next section below).
Side note: there's also the emergence of NewSQL databases that tries to offer high write performance AND efficient querying. Techniques employed include distributed query processing and advanced indexing.
The acronym CAP refers to the following:
- Consistency. All clients receive the most recent write or an error, no matter which node they connect to. This requires any data that is written in one node, to be instantly replicated in all other remaining nodes in the system.
- Availability. All client requests get a response, even if one or more nodes are down (although the data may not necessarily be the most recent version).
- Partition tolerance. The system continues to operate despite network failures (e.g. dropped partitions, slow network connections, or unavailable network connections between nodes.) Note that a partition is a communications break within a distributed system (which could result in a lost / temporarily delayed connection between two nodes).
The CAP theorem (aka Brewer’s Theorem) states that a distributed system can only guarantee two out of these three desired characteristics. A distributed system is a network that stores data on more than one node. Note that all cloud applications are distributed systems and that partitions cannot be avoided in distributed systems.
Practically speaking, this means that partition tolerance will have to be one of the two guarantees, leaving a choice between a database that prioritises either consistency or availability. MongoDB for instance, is the better NoSQL database choice if you favour consistency, whereas Cassandra would be better for availability.