Designing data-intensive applications— Data models and Query language
This is going to be a series of post taken from my learnings of “Designing data-intensive applications” by “Martin Kleppmann”
The “great debate”
The year was 1970, The most popular database for the business processing was IBM’s Information Management System (IMS), developed for stock-keeping in the Apollo space program. IMS was a simple ‘Hierarchical model’, data is represented as a tree with records nested within records, unable to solve the problem of ‘many-to-many’ relation. Thus two solutions were proposed ‘relational model’ (which later became SQL) and ‘network, model’
The network model was standardized by Conference On Data System Language (CODASYL), unlike the hierarchical model with one parent, in this model a record can have multiple parents, but the links were not foreign key rather like pointers, the ‘access path’ was to traverse from the root along the chain of record, the manual access path selection gave the efficient use of hardware at that time, but the code for update and querying the database became complicated and inflexible
Unprepared for that time, the ‘relational model’ made the first strike, its query optimizer automatically decides which part of the query to execute in which order and which index to use, query optimizers over the period of time developed much efficient, the key behind the relational model was they had to build query optimizer only once and all applications using the database benefits from it. The ‘foreign key’ was used to represent the many-to-many and many-to-one relations. The document database had reverted back to the hierarchical model, they did not follow the path of CODASYL.
We will see some verticals on which we could compare the document and relational model
leading to simpler application code
A deeply nested document model always pulls down its power, likewise shredding a document into multiple tables can lead to a cumbersome schema and unnecessarily complicated application code. the poor support for joints adds to this. This made the relational model better for highly interconnected data.
The document model aka, ‘schemaless’ is actually schema-on-read provides the power of a dynamically typed language, when the application code has to change the format of data, just a minor change on the application code is needed to handle and merger the data henceforth processed. Schema change on the relational model has a bad impact, running an ‘update’ query on the large table would likely be slow and may need downtime for doing so, some tools do exist as a workaround to this limitation.schema-on-read is more suited if the data collection does not have the same structure for some reason.
Document model usually stores as a single continuous string, encode as JSON or XML or binary variant (MongoDB’s BSON), this loads the entire document even if a small part of it gets accessed, it is better for this case to keep the document model small and avoid writes that increase the size. on the other hand, the relational database had to provide support for locality by allowing the schema to declare the table’s row. oracle does this by ‘multi-table index cluster tables’, Cassandra and HBase use the concept of ‘column-family’.
With the relational model came in the ‘declarative’ query language (SQL), while the IMS and CODASYL queried the database using ‘imperative’ code
The imperative language told the computer to perform certain operations in a certain order, like a programming loop that prints elements of an array, whereas, in a declarative language you just have to specify the pattern of the data you want, this abstracts the implementation details of the database engine making it easier for adding performance optimizations for queries under the hood. This is not so for the imperative language, say the database implemented using the imperative language, needs to reclaim unused disk space, it has to move the records around without changing the order in which they appeared since the database never knows whether the code is relying on the ordering. A declarative one never bothers about ordering which makes it easier for parallel executions
Graph data model and Query Language
The relation model can handle simple cases of many-to-many relationships but as it gets more complex, it's better to move to a graph model. Facebook maintains a single graph with vertice representing people, locations, events, check-ins and comments of users, edges representing friends, and who commented on which posts. CODASYL’s network model does look similar to the graph model but, it's in no way the same
- CODASYL had a schema that specified which record could be nested with which other, but in a graph database, there are no restrictions, its just vertex, and edges, this provides more flexibility for applications to adopt changes
- Graph database you can refer a record directly bu using its unique ID unlike in the network model where the ‘access path’ is used for traversal
- since the network model had an ordering that had to be maintained, adding a new record made it difficult, while the graph data model has no order to maintain
- in CODASYL, the querying was done using ‘imperative language’ which made it hard for when any schema change happens. The graph model supported for declarative languages like Cypher used in Neo4j or SPARQL
We have all three models being used today, although one model can be emulated in terms of the other, which would often be not recommended. Each with its own purpose and none better than the other.