Friday, January 24, 2014

Review of Mannings Publications latest book "Making Sense of NoSQL"


I have recently finished reading this book and thought of writing my opinion about the book. I have mixed feeling about the book. From my viewpoint the book has several good points while there are few areas where I was expecting more but got little disappointed. I will point out what I did not like as I give my opinion about chapters.

Like most of the NoSQL books the authors start by giving an overview of early NoSQL solutions and the reason why they came into existence. They have given 5 case studies to explain how some of the companies came up with their proprietary solutions to solve their problems of big data. These solutions laid the foundation of the current NoSQL products.

The good part in this book is that the authors first introduce a concept and then they present a case study around the concept to tell the readers how the concept is used in the real world. This is very important since when a technology is new it is very difficult to envisage how to use it in our projects.

Authors have very clearly mentioned the difference between the RDBMS and NoSQL. Authors are not biased towards NoSQL and give a practical opinion on the two techniques. They have tried to build the fundamentals from ground up by giving the usage patterns, the terminologies, examples, key features of the NoSQL solutions
Almost every book tries to explain ACID, BASE, CAP without giving any practical usage. But in this book authors have devoted lot of pages to explain the fundamentals & to suggest how it makes RDBMS different than NoSQL. The authors have explained CAP theorem (applicable only in case of network disconnection) in a very effective manner and I guess this is the best description of the theorem compared to any book on similar topic.
In few chapters there is a “Apply your Knowledge” at the end which tries to give a real world situation and tries to find out the solution based on the concepts build in the chapter. This is a very good way of showing the utility of the concepts to the audience. But I think this section could have been much better. Usually the arguments given to choose the options are not good or the authors will directly tell which the correct solution is without discussing much about how to select that approach. In short the audience will understand what the solution for the given problem should be, but the approach to reach there might not be that clear.

Authors highlight that SQL Joins is the reason why RDBMS is not efficient in cluster environment as the tables can be spread on multiple systems and then to find the join will become highly complex. They also suggest that “Transactions” is the main reason why RDBMS gained so much popularity specially in the e-commerce & banking. They have dedicated a large chunk of chapter 3 on Transactions and their importance. NoSQL does not have transactions built in & it is one of the reasons why enterprises are little skeptical on their acceptance.
Authors have suggested many places that it is practical to have a combination of the RDBMS & NoSQL. One such use case highlighted is that of the OLAP

In chapter 4 authors have formally introduced to the various NoSQL architectures on which most of the solutions are based. One of the unique information (different from other books) is the effect of the system environment like RAM, SSD on the performance & capability of the NoSQL systems. To understand how H/W variations can impact the choice is interesting to know especially when most of them are Cloud based solutions and a consumer we can choose what configuration we want from the cloud vendor. You can define the various Quality of service kind of parameters like max read time, max write time, replication factor etc.  Depending on the parameters the cost of the solution will vary.  So basically we have a list of parameters that determine the cost of the NoSQL solutions. There is a very good analogy of the radio to explain how the various parameters lead to the cost of the solution
However I am little disappointed the content of the chapter4 and feel that more content could have been added.
Authors could have given some more examples of the key-value stores like to maintain the shopping cart, User profiles. The high available systems where he points out the DNS, directory services could have been covered as the use cases for the Key value stores.

Authors did not clearly tell why the query in the key store will be fast or what are the ways in which query engine will go about searching for the information. It is also not very clearly highlighted how a key value store is different than a RDBMS table with two columns either both are blobs or string-Blob. Also when not to use the Key–value is not highlighted.

In general I feel that the book does not have much of hands on details. It is good from a theoretical point but not from the practical perspective like how to actually write a query in a popular NoSQL solution. It says that with the help of the graph database the user can perform relational queries but how to actually go about doing that is not mentioned

I feel the introduction of the type of the NoSQL databases could be more structured and could have revolved around some common features like how well ACID properties are supported, How easy it is to query (somewhat covered ), How well is the distribution of data managed and some common examples of the existing databases. This gives a better understanding of the reason why we have different kinds of NoSql architectures
They have tried to explain the column family databases, but I guess a proper example where author are explaining how to create a key would have been very helpful.
They have given 3 examples of the column family database and tried to explain how the solution will help the users, but along with this they should have given a small sketch of entries that will be actually written in such databases. I could not understand how it will be easy to retrieve the data from the column family database compared to other solutions.

Document oriented database has been given less description. Again as a reader I would not want to refer to multiple books to understand each of these 4 families. This book should have given me a clear picture on all the 4 types.

In chapter 6 Authors have introduced to the various NoSQL database solutions and why the organizations are moving towards the NoSQL solution. Gives a list of use cases where NoSQL can be helpful like event logs, remote sensor data, trends from social media data.

Authors highlight the fact that the NoSQL databases generally provide an easy to use interface to perform really complex tasks. This is one of the reasons why the NoSQL projects are being used without a steep learning curve.
Big Data problems have various scaling needs based on their domain especially the Linear scalability which is one of the biggest problems with the RDBMS solutions & has been an inherent feature of almost all the NoSQL offerings. With so many offerings available in the market the solutions might be chosen based on the query expressivity and the degree of scalability. For e.g. key value pairs are most scalable but least expressive as your queries can only be on the keys. The document stores are most expressive as you can query on all the fields of the stored document record

Authors have tried to give a wide spectrum of use cases and the applicability of various NoSQL solutions. They have pointed out a different type of NoSQL solution of a particular category of use case
I think one of the weakest parts in this book is the editing. Some of the information given in chapter 6 should have been in the chapter 4 as it is vital to understand the difference in the various architectures of NoSQL databases but you have to wait for later chapters to get those details. One of the diagrams that point out lot of important aspects of the various types of NoSQL should have been in chapter4 rather than in chapter 6.

This book is different from many other books on the subject which only talk about the various types of database and compare them against each other, talk about the basic concepts like ACID BASE and then end the discussion. This book tries to bring in the usage aspect. How we can use the NoSQL in conjunction with other technologies to add value to the business like how we can improve the ETL process, How we can make the documents that are stored in multiple locations searchable, how to make use of map reduce algorithms with Hadoop to get the batch processing done on large data

I find some resource sharing of the concepts like Shared RAM, Shared Disk, Shared-nothing are quite interesting and have not been covered in other books in such a great details. Authors have tried to give the practical implications of each of the physical architecture on the database performance. They even say that knowing the hardware options available to big data is important first step in choosing the NoSQL

Then they discuss the distributed processing models of Master-Slave and peer-peer architecture. peer to peer is more complex but less prone to failures hence provide high availability. Master-Slave is less complex but is prone to single point failure

The authors have given a high level overview of the map-reduce paradigm. Authors have highlighted that one of the main concern of the map-reduce program is the uniform distribution of the tasks to all the nodes in the cluster. If the load is not uniform then the performance will get a hit as too much of work will be done by the single node. They also pointed out that one of the selection criteria NoSQL solution is how well it gets integrated with the Hadoop system

Authors discuss ways how the data is handled like distributed query, data distribution over nodes, replications so that time to answer the query can be reduced when dealing with large amount of data
Basically they have tried to tell that data distribution along with distributed processing is the key to NoSQL success
They have introduce Apache Flume which is one of the popular ways to analyze the event logs with the help of Hadoop and HDFS

They have given a case study to give a practical use of the NoSQL solutions in analyzing the distributed event logs in an enterprise. But I find that though the introduction is good and you are eager to know how it is done but you don’t get answers to lot of the actors like the fast channel For e.g. how will you filter the critical events from all the incoming events. What is this fast channel? Is it a notification server or some kind of RDBMS?.

With respect to the graph database the authors suggest that they can be used in the healthcare industry to find the frauds in the healthcare frauds. Authors gave an overview of the problem domain but did not explain the nuts and bolts of how to create the large shared RAM which can be extended in the future.
Some of the questions remain unanswered: Is there any size limitation on building the shared memory. What is the reliability of the shared RAM? Can a failure of one individual RAM chip bring down the entire infrastructure?

In the chapter 7 Authors touch upon a very relevant and important topic of how to fetch the data from the NoSQL in a timely manner with high recall and precision.

They have pointed that usually the NoSQL databases will combine some well-established search library like Apache Lucene or Apache Solr to provide reliable full text functionality.

Authors explain the key terms used while building search functionality like stemming, indexing, proximity search, scores, boosting, rank, storage strategy for the indexes and the actual data.

Authors have pointed out the use of map-reduce in creating the reverse index. In this chapter also they have tried to give the practical aspect of using the technology to solve the real life problems.

Authors have presented several case studies to explain how the search functionality finds it usage and in order to build that capability what needs to be done .They have given an example of searching technical documentation where they have explained the concept of boosting in detail and how to create a search engine for finding the correct chart in financial enterprise using the XML based database and Lucene. Finally they describe a common problem present in every software organization where you have lot of project documents (SDLC documents) in various formats like docx, pdf, jpeg but no easy way to find the information from them. They suggest how NoSQL can be of use in this scenario.

In chapter 8 Authors have given a good introduction to the subject of high availability taking an e-commerce website example. They have tried to point out the business impact of database going down while the customer is online. They have explained various jargons associated with High availability like Failure metrics, automatic failover, Client yield, Harvest Metric, Load balancing, Clusters, and Replication etc.
Authors have emphasized on the design pattern which advocates for moving query to the data and not the data to the query to save the time & network bandwidth in transferring large chunk of data from one node to the processing node. Also the query can be distributed and can use the processing power of the various nodes (shared-nothing architecture).

Authors have presented the case study of 3 popular NoSQL solutions which are known for their high availability features. They point out that Amazon DynamoDB provides the flexibility to the user to choose the read and write throughputs, type of read consistency & scale up or down to support the elastic demand. However in this case study I did not find clear picture of how the high availability is maintained in DynamoDB. What I understood was how flexible the option is.

Authors have then presented a case study on Cassandra & Couchbase and how it meets the high availability expectations. The discussion is brief and is approached from reliability & high availability perspective

I did not like Chapter 9  which is on agility because the only take away for me was that NoSQL databases are schema less so it is easy to change the fields, which I think has been covered in the introductory chapter. The case study did not help me to understand the agility part and I doubt if the arguments are correct. I think this chapter can be avoided or the main points be included in some earlier chapters

Chapter 10 is a heavy weight chapter and should be read with relaxed mind. I however don’t understand the utility of this chapter in the NoSQL book. This chapter forms the fundamentals of the functional programming and hence the map-reduce. The chapter provides very high level overview of the functional programming but since the topic is complex, high level description does not serve much purpose. I think the section where the authors have given a comparison of imperative and functional programming with a diagram that could be all that they should have provided followed by the map-reduce. This could have therefore moved to the chapter 6 where they introduce the map-reduce.

 “Apply your knowledge section” does not connect the dots. How is the content related to NoSQL and why it could not have been in a functional programming book? Also the whole argument of cache and relating to functional programming does not make sense. But again this is my view.

The only good part is the introduction to Erlang as it gives an idea as to why it is becoming so popular.

Chapter 11 has nice details on implicit requirement to have security feature in various NoSQL databases. This is a nicely written content and is an easy read. Authors help us to first understand the fundamentals involved with building the security in a database solution which boils down to 4 key aspects: authentication, authorization, audit, and encryption processes. You want to make sure that only the right people have access to the appropriate data in your database. You also want to track their access and transmit data securely in and out of the database.
The chapter also points that RDBMS are better off than the NoSQL since they have quite matured security system. In case of NOSQL it is still not matured. They have discussed that the security can be shared at the application and the database level. Then the authors tried to give a map of various techniques and their relative advantages overs each other

Authors have taken few case studies to show how popular services implement their securities and to what granularity. Amazon S3 provides the bucket level security while Apache Accumulo the authorization is applied to each of the key value pair. The user can be denied access to particular keys.
In the last chapter, Authors have pointed that it is not a simple exercise to decide upon the right database solution for the project. Non-familiarity with the NoSQL paradigm might be the greatest hurdle for the project team to accept it as a data base option. They might be biased towards a particular RDBMS technology which they have been using for decades.

Authors have presented an architectural trade off analysis to objectively select the right DB that’s the best fit for a business problem. This is basically intended to list and prioritize the business requirements and then to score the effort required to choose a particular NoSQL solution to implement them. There is a long list of dos and don’ts for creating an architectural team. Then there are guidelines of how to go about doing the trade-off analysis.

Authors have suggested that apart from comparing the database solutions from technical capability like indexes, query support, web access, they should also be compared on the quality parameters like scalability, availability, portability, searchability, agility.

Authors suggest preferring solutions that can be deployed on the cloud as they will reduce the infrastructure cost and at the same time scalability can be taken care of.

Authors have given a nice diagram (Quality Tree) that shows how the different quality parameters can be quantified in terms of specific features like searchability: transforms to full-text search, xml search, custom scoring and how each of the features are prioritized for the organization. Sometimes the quality tree helps in bringing the stake holders on the same page specially in case when the there is a mixed audience and they might not be conversant with the technological jargons

I think this chapter is meant for the project managers or the architects who are trying to get the nod of the management and the other stakeholders. Techniques like the quality tree, architectural trade off analysis that will help in preparing a strong case and driving the point across the table.

To summarize:

One problem I have seen consistently with this book is that the introduction section in each chapter lists down some topics which are intended to be covered in that chapter but the coverage is not proper. Few of the topics are not covered or are insignificantly covered. I think, that might get resolved in future editions of this book. Even the title of case studies and the content of case studies are not similar. Sometimes the case studies are more on a product than the concept that was being discussed in the chapter.

Authors generally introduce the topic in a very simple manner giving examples and pictures. The case studies are something which can be a very good way to show the applicability of the topic in the real life system so that the readers can know how to apply the concepts. Unfortunately even though each chapter has lot of case studies but with almost all the case studies there are missing information or digressions which do not help the author in effectively summarizing the topic. They are good to read independently but when you try to see the applicability of the topic in discussion I find some gaps. Hopefully these gaps will be filled in the upcoming editions

Even though there are problems with the content but there are advantages too. The biggest advantage of this book is that it has covered topics like full text search, high availability, map reduce using NoSQL database which generally other books don’t cover or maybe they cover it in their further reading sections. So it is a good book to get the introduction of the complexity and opportunities provided by NoSQL databases. The case studies point out how complex scenarios can be resolved using NoSQL solutions which were very difficult to implement few years back.


My recommendation is that this book should be read to get a wide perspective of the subject and not to become master of the subject.