Typical arguments about in-memory database systems follows:
This is indeed an interesting argument and one that I’m not going to argue against. But it still feels like elastic caching or in-memory elastic databases will remain just a part of the software equation:
– Even if the price of RAM has continued to decrease, the machines mentioned do not sound like commodity hardware so you’ll have to balance the costs with the value of data.
– It still sounds like vertical scaling (nb not saying that vertical scaling is always bad)
– There will always be data that will fit better on disk (e.g. video)
– The more data will be accumulated the more you’d like to make sure that querying it (nb online or offline) is not expensive
Those are just one of the many examples of taking the points to extremes. Before giving an answer to those we should go over those lines from 2008.
Steven Robbins published an interesting article on InfoQ titled RAM is the new disk In the comment thread, Steven Robbins quoted Tim Bray and others comparing file system performance to memory:
Memory is the new disk! With disk speeds growing very slowly and memory chip capacities growing exponentially, in-memory software architectures offer the prospect of orders-of-magnitude improvements in the performance of all kinds of data-intensive applications. Small (1U, 2U) rack-mounted servers with a terabyte or more or memory will be available soon, and will change how we think about the balance between memory and disk in server architectures….
It raises the following questions:
What if the disk were RAM-based? Does that mean that all we need to do is replace the current disks with RAM technology to gain speed? The title of the article leads people to think along those lines.
It’s not just the speed of memory compared to disks that makes a difference. It’s not even the extra benefit of the collocation of CPU and memory. What’s really a important is the fact that disk is a sequential storage medium that was designed primarily to store a stream of bytes, not tables of data. That means that if you want to store data objects you need to serialize them into bytes, map sectors in the file system that points to the location of those bytes. Maintaining an index to this data is a relatively expensive operation as every additional index is stored as a copy of the original data, there is no real option to access data by reference, etc. If you think about it, existing RDBMS are basically a mapping layer between data-tables representations and sequential storage. A large part of existing database implementations is spent on addressing the impedance mismatch between the two representations models. All this complexity doesn’t really exist when we’re dealing with memory. That means that if will take existing databases and run them on memory based devices we’re basically going to force the limitations of sequential storage representations into memory.
To exploit the real value of memory based resources we need to have different approach and implementations that assume that data can be accessed by reference – that objects can be accessed directly from our application without complex mapping layer in our native application domain.
At this point I’d like to end with Tim’s last remark:
Disk will become the new tape, and will be used in the same way, as a sequential storage medium (streaming from disk is reasonably fast) rather than as a random-access medium (very slow). Tons of opportunities there to develop new products that can offer 10x-100x performance improvements over the existing ones.
Second important credit should be given to the important paper that changed our point of view:
The Case for RAMClouds:
Scalable High-Performance Storage Entirely in DRAM
Department of Computer Science
Or You can go over:
The End of an Architectural Era
(It’s Time for a Complete Rewrite)
Michael Stonebraker -MIT