Although all NoSQL databases promise horizontal scalability they don’t rise up equally to the challenge. The Bigtable clones — HBase and Hypertable — stand in front and in-memory stores, like Membase or Redis, and document databases, like MongoDB and Couchbase Server, lag behind. This difference is amplified as the data size becomes very large, especially if it grows over a few petabytes.
Bigtable and its clones promote the storage of large individual data points and large collections of data. The Bigtable model supports a large number of columns and an immensely large number of rows. The data can be sparse where many columns have no value. The Bigtable model, of course, does not waste space and simply doesn’t store cells that have no value.
Google led the column-family-centric data store revolution to store the large and ever growing web index its crawlers brought home. The Web has been growing in unbounded ways for the past several years. Google needed a store to grow with the expanding index. Therefore, Bigtable and its clones were built to scale out, limited only by the hardware available to spin off newer nodes in the cluster. Over the past few years, Google has successfully used the Bigtable model to store and retrieve a variety of data that is also very large in volume.
The HBase wiki lists a number of users on its Powered By page (http://wiki.apache.org/hadoop/Hbase/PoweredBy). Some users listed clearly testify to HBase’s capability to scale. Meetup (www.meetup.com) is a popular site that facilitates user groups and interest groups to organize local events and meetings. Meetup has grown from a small, unknown site in 2001 to 8 million members in 100 countries, 65,000+ organizers, 80,000+ meetup groups, and 50,000 meetups each week (http://online.wsj.com/article/SB10001424052748704170404575624733792905708.html). Meetup is an HBase user. All group activity is directly written to HBase and is indexed per member. A member’s custom feed is directly served from HBase.
Facebook is another big user of HBase. Facebook messaging is built on HBase. Facebook was the number one destination site on the Internet in 2010. It has grown to more than 500 million active users (www.facebook.com/press/info.php?statistics) and is the largest software application in terms of the number of users. Facebook messaging is a robust infrastructure that integrates chat, SMS, and e-mail. Hundreds of billions of messages are sent every month through this messaging infrastructure. The engineering team at Facebook shared a few notes on using HBase for their messaging infrastructure. Read the notes online at www.facebook.com/notes/facebook-engineering/the-underlyingtechnology-of-messages/454991608919.
HBase has some inherent advantages when it comes to scaling systems. HBase supports auto load balancing, failover, compression, and multiple shards per server. HBase works well with the Hadoop distributed fi lesystem (a.k.a. HDFS, which is a massively scalable distributed fi lesystem). You know from earlier chapters that HDFS replicates and automatically re-balances to easily accommodate large files that span multiple servers. Facebook chose HBase to leverage many of these features. HBase is a necessity for handling the number of messages and users they serve. The Facebook engineering notes also mention that the messages in their infrastructure are short, volatile, and temporal and are rarely accessed later. HBase, and in general Bigtable clones, are particularly suitable when ad-hoc querying of data is not important. You know that HBase supports the querying of data sets but is a weak replacement to an RBDMS as far as its querying capabilities are concerned. Infrastructures like Google App Engine (GAE) successfully expose a data modeling API, with advanced querying capabilities, on top of the Bigtable.
So it seems clear that column-family-centric NoSQL databases are a good choice if extreme scalability is a requirement. However, such databases may not be the best choice for all types of systems, especially those that involve real-time transaction processing. An RDBMS often makes a better choice than any NoSQL flavor if transactional integrity is very important. Eventually consistent NoSQL options, like Cassandra or Riak, may be workable if weaker consistency is acceptable. Amazon has demonstrated that massively scalable e-commerce operations may be a use case for eventually consistent data stores, but examples beyond Amazon where such models apply well are hard to find. Databases like Cassandra follow the Amazon Dynamo paradigm and support eventual consistency. Cassandra promises incredibly fast read and write speeds. Cassandra also supports Bigtable-like column-family-centric data modeling. Amazon Dynamo also inspired Riak. Riak supports a document store abstraction in addition to being an eventually consistent store. Both Cassandra and Riak scale well in horizontal clusters but if scalability is of paramount importance, my vote goes in favor of HBase or Hypertable over the eventually consistent stores. Perhaps places where eventually consistent stores fare better than sorted ordered column-family stores is where write throughput and latency is important. Therefore, if both horizontal scalability and high write throughput are required, possibly consider Cassandra or Riak. Even in these cases, consider a hybrid approach where you can logically partition the data write process from the access and analytics and use two separate databases for each of the tasks.
If scalability implies large data becoming available at an incredibly fast pace, for example stock market tick data or advertisement click tracking data, then column-family stores alone may not provide a complete solution. It’s prudent to store the massively growing data in these stores and manipulate them using MapReduce operations for batch querying and data mining, but you may need something more nimble for fast writes and real-time manipulation. Nothing is faster than manipulating the data in memory and so leveraging NoSQL options that keep data in memory and flush it to disk when it fi lls the available capacity are probably good choices. Both MongoDB and Redis follow this strategy. Currently, MongoDB uses mmap and Redis implements a custom mapping from memory to disk. However, both MongoDB and Redis, have actively been re-engineering their memory mapping feature and things will continue to evolve. Using MongoDB or Redis with HBase or Hypertable makes a good choice for a system that needs fast real-time data manipulation and a store for extensive data mining. Memcached and Membase can be used in place of MongoDB or Redis. Memcached and Membase act as a layer of fast and efficient cache, and therefore supplement well on top of column-family stores. Membase has been used effectively with Hadoop-based systems for such
use cases. With the merger of Membase and CouchDB, a well integrated NoSQL product with both fast cache-centric features and distributed scalable storage-centric features is likely to emerge.
Although scalability is very important if your data requirements grow to the size of Google’s or Facebook’s, not all applications become that large. Scalable systems are probably relevant for cases much smaller than these widespread systems but sometimes an attempt to make things scalable can become an exercise in over-engineering. You certainly want to avoid unnecessary complexity.
In many systems, data integrity and transactional consistency are more important than any other requirements. Is NoSQL an option for such systems?
Source of Information : NoSQL
Bigtable and its clones promote the storage of large individual data points and large collections of data. The Bigtable model supports a large number of columns and an immensely large number of rows. The data can be sparse where many columns have no value. The Bigtable model, of course, does not waste space and simply doesn’t store cells that have no value.
Google led the column-family-centric data store revolution to store the large and ever growing web index its crawlers brought home. The Web has been growing in unbounded ways for the past several years. Google needed a store to grow with the expanding index. Therefore, Bigtable and its clones were built to scale out, limited only by the hardware available to spin off newer nodes in the cluster. Over the past few years, Google has successfully used the Bigtable model to store and retrieve a variety of data that is also very large in volume.
The HBase wiki lists a number of users on its Powered By page (http://wiki.apache.org/hadoop/Hbase/PoweredBy). Some users listed clearly testify to HBase’s capability to scale. Meetup (www.meetup.com) is a popular site that facilitates user groups and interest groups to organize local events and meetings. Meetup has grown from a small, unknown site in 2001 to 8 million members in 100 countries, 65,000+ organizers, 80,000+ meetup groups, and 50,000 meetups each week (http://online.wsj.com/article/SB10001424052748704170404575624733792905708.html). Meetup is an HBase user. All group activity is directly written to HBase and is indexed per member. A member’s custom feed is directly served from HBase.
Facebook is another big user of HBase. Facebook messaging is built on HBase. Facebook was the number one destination site on the Internet in 2010. It has grown to more than 500 million active users (www.facebook.com/press/info.php?statistics) and is the largest software application in terms of the number of users. Facebook messaging is a robust infrastructure that integrates chat, SMS, and e-mail. Hundreds of billions of messages are sent every month through this messaging infrastructure. The engineering team at Facebook shared a few notes on using HBase for their messaging infrastructure. Read the notes online at www.facebook.com/notes/facebook-engineering/the-underlyingtechnology-of-messages/454991608919.
HBase has some inherent advantages when it comes to scaling systems. HBase supports auto load balancing, failover, compression, and multiple shards per server. HBase works well with the Hadoop distributed fi lesystem (a.k.a. HDFS, which is a massively scalable distributed fi lesystem). You know from earlier chapters that HDFS replicates and automatically re-balances to easily accommodate large files that span multiple servers. Facebook chose HBase to leverage many of these features. HBase is a necessity for handling the number of messages and users they serve. The Facebook engineering notes also mention that the messages in their infrastructure are short, volatile, and temporal and are rarely accessed later. HBase, and in general Bigtable clones, are particularly suitable when ad-hoc querying of data is not important. You know that HBase supports the querying of data sets but is a weak replacement to an RBDMS as far as its querying capabilities are concerned. Infrastructures like Google App Engine (GAE) successfully expose a data modeling API, with advanced querying capabilities, on top of the Bigtable.
So it seems clear that column-family-centric NoSQL databases are a good choice if extreme scalability is a requirement. However, such databases may not be the best choice for all types of systems, especially those that involve real-time transaction processing. An RDBMS often makes a better choice than any NoSQL flavor if transactional integrity is very important. Eventually consistent NoSQL options, like Cassandra or Riak, may be workable if weaker consistency is acceptable. Amazon has demonstrated that massively scalable e-commerce operations may be a use case for eventually consistent data stores, but examples beyond Amazon where such models apply well are hard to find. Databases like Cassandra follow the Amazon Dynamo paradigm and support eventual consistency. Cassandra promises incredibly fast read and write speeds. Cassandra also supports Bigtable-like column-family-centric data modeling. Amazon Dynamo also inspired Riak. Riak supports a document store abstraction in addition to being an eventually consistent store. Both Cassandra and Riak scale well in horizontal clusters but if scalability is of paramount importance, my vote goes in favor of HBase or Hypertable over the eventually consistent stores. Perhaps places where eventually consistent stores fare better than sorted ordered column-family stores is where write throughput and latency is important. Therefore, if both horizontal scalability and high write throughput are required, possibly consider Cassandra or Riak. Even in these cases, consider a hybrid approach where you can logically partition the data write process from the access and analytics and use two separate databases for each of the tasks.
If scalability implies large data becoming available at an incredibly fast pace, for example stock market tick data or advertisement click tracking data, then column-family stores alone may not provide a complete solution. It’s prudent to store the massively growing data in these stores and manipulate them using MapReduce operations for batch querying and data mining, but you may need something more nimble for fast writes and real-time manipulation. Nothing is faster than manipulating the data in memory and so leveraging NoSQL options that keep data in memory and flush it to disk when it fi lls the available capacity are probably good choices. Both MongoDB and Redis follow this strategy. Currently, MongoDB uses mmap and Redis implements a custom mapping from memory to disk. However, both MongoDB and Redis, have actively been re-engineering their memory mapping feature and things will continue to evolve. Using MongoDB or Redis with HBase or Hypertable makes a good choice for a system that needs fast real-time data manipulation and a store for extensive data mining. Memcached and Membase can be used in place of MongoDB or Redis. Memcached and Membase act as a layer of fast and efficient cache, and therefore supplement well on top of column-family stores. Membase has been used effectively with Hadoop-based systems for such
use cases. With the merger of Membase and CouchDB, a well integrated NoSQL product with both fast cache-centric features and distributed scalable storage-centric features is likely to emerge.
Although scalability is very important if your data requirements grow to the size of Google’s or Facebook’s, not all applications become that large. Scalable systems are probably relevant for cases much smaller than these widespread systems but sometimes an attempt to make things scalable can become an exercise in over-engineering. You certainly want to avoid unnecessary complexity.
In many systems, data integrity and transactional consistency are more important than any other requirements. Is NoSQL an option for such systems?
Source of Information : NoSQL
|
0 comments
Post a Comment