I have recently finished reading this book and thought of writing my opinion about the book. I have mixed feeling about the book. From my viewpoint the book has several good points while there are few areas where I was expecting more but got little disappointed. I will point out what I did not like as I give my opinion about chapters.
Like most of the NoSQL books the authors start by giving
an overview of early NoSQL solutions and the reason why they came into
existence. They have given 5 case studies to explain how some of the companies
came up with their proprietary solutions to solve their problems of big data.
These solutions laid the foundation of the current NoSQL products.
The good part in this book is that the authors first
introduce a concept and then they present a case study around the concept to
tell the readers how the concept is used in the real world. This is very
important since when a technology is new it is very difficult to envisage how
to use it in our projects.
Authors have very clearly mentioned the difference
between the RDBMS and NoSQL. Authors are not biased towards NoSQL and give a
practical opinion on the two techniques. They have tried to build the fundamentals
from ground up by giving the usage patterns, the terminologies, examples, key
features of the NoSQL solutions
Almost every book tries to explain ACID, BASE, CAP without
giving any practical usage. But in this book authors have devoted lot of pages
to explain the fundamentals & to suggest how it makes RDBMS different than
NoSQL. The authors have explained CAP theorem (applicable only in case of
network disconnection) in a very effective manner and I guess this is the best
description of the theorem compared to any book on similar topic.
In few chapters there is a “Apply your Knowledge” at the
end which tries to give a real world situation and tries to find out the solution
based on the concepts build in the chapter. This is a very good way of showing
the utility of the concepts to the audience. But I think this section could
have been much better. Usually the arguments given to choose the options are
not good or the authors will directly tell which the correct solution is
without discussing much about how to select that approach. In short the
audience will understand what the solution for the given problem should be, but
the approach to reach there might not be that clear.
Authors highlight that SQL Joins is the reason why RDBMS
is not efficient in cluster environment as the tables can be spread on multiple
systems and then to find the join will become highly complex. They also suggest
that “Transactions” is the main reason why RDBMS gained so much popularity
specially in the e-commerce & banking. They have dedicated a large chunk of
chapter 3 on Transactions and their importance. NoSQL does not have
transactions built in & it is one of the reasons why enterprises are little
skeptical on their acceptance.
Authors have suggested many places that it is practical
to have a combination of the RDBMS & NoSQL. One such use case highlighted
is that of the OLAP
In chapter 4 authors have formally introduced to the
various NoSQL architectures on which most of the solutions are based. One of
the unique information (different from other books) is the effect of the system
environment like RAM, SSD on the performance & capability of the NoSQL
systems. To understand how H/W variations can impact the choice is interesting
to know especially when most of them are Cloud based solutions and a consumer
we can choose what configuration we want from the cloud vendor. You can define
the various Quality of service kind of parameters like max read time, max write
time, replication factor etc. Depending
on the parameters the cost of the solution will vary. So basically we have a list of parameters
that determine the cost of the NoSQL solutions. There is a very good analogy of
the radio to explain how the various parameters lead to the cost of the
solution
However I am little disappointed the content of the
chapter4 and feel that more content could have been added.
Authors could have given some more examples of the
key-value stores like to maintain the shopping cart, User profiles. The high
available systems where he points out the DNS, directory services could have
been covered as the use cases for the Key value stores.
Authors did not clearly tell why the query in
the key store will be fast or what are the ways in which query engine will go
about searching for the information. It is also not very
clearly highlighted how a key value store is different than a RDBMS table with
two columns either both are blobs or string-Blob. Also
when not to use the Key–value is not highlighted.
In general I feel that the book does not have
much of hands on details. It is good from a theoretical point but not from the
practical perspective like how to actually write a query in a popular NoSQL
solution. It says that with the help of the graph database the user can perform
relational queries but how to actually go about doing that is not mentioned
I feel the introduction of the type of the
NoSQL databases could be more structured and could have revolved around some
common features like how well ACID properties are supported, How easy it is to
query (somewhat covered ), How well is the distribution of data managed and
some common examples of the existing databases. This gives a better
understanding of the reason why we have different kinds of NoSql architectures
They have tried to explain the column family
databases, but I guess a proper example where author are explaining how to
create a key would have been very helpful.
They have given 3 examples of the column
family database and tried to explain how the solution will help the users, but
along with this they should have given a small sketch of entries that will be
actually written in such databases. I could not understand how it will be easy
to retrieve the data from the column family database compared to other
solutions.
Document oriented database has been given less
description. Again as a reader I would not want to
refer to multiple books to understand each of these 4 families. This book
should have given me a clear picture on all the 4 types.
In chapter 6 Authors have introduced to the
various NoSQL database solutions and why the organizations are moving towards
the NoSQL solution. Gives a list of use cases where NoSQL can be helpful like
event logs, remote sensor data, trends from social media data.
Authors highlight the fact that the NoSQL
databases generally provide an easy to use interface to perform really complex
tasks. This is one of the reasons why the NoSQL projects are being used without
a steep learning curve.
Big Data problems have various scaling needs
based on their domain especially the Linear scalability which is one of the biggest
problems with the RDBMS solutions & has been an inherent feature of almost
all the NoSQL offerings. With so many offerings available in the market the
solutions might be chosen based on the query expressivity and the degree of
scalability. For e.g. key value pairs are most scalable but least expressive as
your queries can only be on the keys. The document stores are most expressive
as you can query on all the fields of the stored document record
Authors have tried to give a wide spectrum
of use cases and the applicability of various NoSQL solutions. They have
pointed out a different type of NoSQL solution of a particular category of use
case
I think one of the weakest parts in this
book is the editing. Some of the information given in chapter 6 should have
been in the chapter 4 as it is vital to understand the difference in the
various architectures of NoSQL databases but you have to wait for later
chapters to get those details. One of the diagrams that point out lot of
important aspects of the various types of NoSQL should have been in chapter4
rather than in chapter 6.
This book is different from many other books
on the subject which only talk about the various types of database and compare
them against each other, talk about the basic concepts like ACID BASE and then
end the discussion. This book tries to bring in the usage aspect. How we can
use the NoSQL in conjunction with other technologies to add value to the
business like how we can improve the ETL process, How we can make the documents
that are stored in multiple locations searchable, how to make use of map reduce
algorithms with Hadoop to get the batch processing done on large data
I find some resource sharing of the concepts
like Shared RAM, Shared Disk, Shared-nothing are quite interesting and have not
been covered in other books in such a great details. Authors have tried to give
the practical implications of each of the physical architecture on the database
performance. They even say that knowing the hardware options available to big
data is important first step in choosing the NoSQL
Then they discuss the distributed processing
models of Master-Slave and peer-peer architecture. peer to peer is more complex but less prone to failures hence
provide high availability. Master-Slave is less complex but is prone to single
point failure
The authors have given a high level overview
of the map-reduce paradigm. Authors have highlighted that one of the main
concern of the map-reduce program is the uniform distribution of the tasks to
all the nodes in the cluster. If the load is not uniform then the performance
will get a hit as too much of work will be done by the single node. They also
pointed out that one of the selection criteria NoSQL solution is how well it gets
integrated with the Hadoop system
Authors discuss ways how the data is handled
like distributed query, data distribution over nodes, replications so that time
to answer the query can be reduced when dealing with large amount of data
Basically they have tried to tell that data
distribution along with distributed processing is the key to NoSQL success
They have introduce Apache Flume which is
one of the popular ways to analyze the event logs with the help of Hadoop and
HDFS
They have given a case study to give a
practical use of the NoSQL solutions in analyzing the distributed event logs in
an enterprise. But I find that though the introduction is good and you are
eager to know how it is done but you don’t get answers to lot of the actors
like the fast channel For e.g. how will you filter the critical events from all
the incoming events. What is this fast channel? Is it a notification server or
some kind of RDBMS?.
With respect to the graph database the
authors suggest that they can be used in the healthcare industry to find the
frauds in the healthcare frauds. Authors gave an overview of the problem domain
but did not explain the nuts and bolts of how to create the large shared RAM
which can be extended in the future.
Some of the questions remain unanswered: Is
there any size limitation on building the shared memory. What is the
reliability of the shared RAM? Can a failure of one individual RAM chip bring
down the entire infrastructure?
In the chapter
7 Authors touch upon a very relevant and important topic of how to fetch
the data from the NoSQL in a timely manner with high recall and precision.
They have pointed that usually the NoSQL
databases will combine some well-established search library like Apache Lucene
or Apache Solr to provide reliable full text functionality.
Authors explain the key terms used while
building search functionality like stemming, indexing, proximity search,
scores, boosting, rank, storage strategy for the indexes and the actual data.
Authors have pointed out the use of map-reduce
in creating the reverse index. In this chapter also they have tried to give the
practical aspect of using the technology to solve the real life problems.
Authors have presented several case studies
to explain how the search functionality finds it usage and in order to build
that capability what needs to be done .They have given an example of searching
technical documentation where they have explained the concept of boosting in
detail and how to create a search engine for finding the correct chart in
financial enterprise using the XML based database and Lucene. Finally they
describe a common problem present in every software organization where you have
lot of project documents (SDLC documents) in various formats like docx, pdf,
jpeg but no easy way to find the information from them. They suggest how NoSQL
can be of use in this scenario.
In chapter
8 Authors have given a good introduction to the subject of high availability
taking an e-commerce website example. They have tried to point out the business
impact of database going down while the customer is online. They have explained
various jargons associated with High availability like Failure metrics,
automatic failover, Client yield, Harvest Metric, Load
balancing, Clusters, and Replication etc.
Authors have emphasized on the design pattern which advocates
for moving query to the data and not the data to the query to save the time
& network bandwidth in transferring large chunk of data from one node to
the processing node. Also the query can be distributed and can use the processing
power of the various nodes (shared-nothing architecture).
Authors have presented the case study of 3 popular NoSQL
solutions which are known for their high availability features. They point out
that Amazon DynamoDB provides the flexibility to the user to choose the read
and write throughputs, type of read consistency & scale up or down to
support the elastic demand. However in this case study
I did not find clear picture of how the high availability is maintained in
DynamoDB. What I understood was how flexible the option is.
Authors have then presented a case study on Cassandra
& Couchbase and how it meets the high availability expectations. The
discussion is brief and is approached from reliability & high availability
perspective
I did not like Chapter 9 which is on agility because the
only take away for me was that NoSQL databases are schema less so it is easy to
change the fields, which I think has been covered in the introductory chapter.
The case study did not help me to understand the agility part and I doubt if
the arguments are correct. I think this chapter can be avoided or the main
points be included in some earlier chapters
Chapter 10 is a heavy weight
chapter and should be read with relaxed mind. I however
don’t understand the utility of this chapter in the NoSQL book. This
chapter forms the fundamentals of the functional programming and hence the
map-reduce. The chapter provides very high level overview of the functional
programming but since the topic is complex, high level description does not
serve much purpose. I think the section where the authors have given a
comparison of imperative and functional programming with a diagram that could
be all that they should have provided followed by the map-reduce. This could
have therefore moved to the chapter 6 where they introduce the map-reduce.
“Apply your knowledge section”
does not connect the dots. How is the content related to NoSQL and why it could
not have been in a functional programming book? Also the whole argument of
cache and relating to functional programming does not make sense. But again
this is my view.
The
only good part is the introduction to Erlang as it gives an idea as to why it
is becoming so popular.
Chapter 11 has nice details on implicit
requirement to have security feature in various NoSQL databases. This is a
nicely written content and is an easy read. Authors help us to first understand
the fundamentals involved with building the security in a database solution
which boils down to 4 key aspects: authentication,
authorization, audit, and encryption processes. You want to make sure that only
the right people have access to the appropriate data in your database. You also
want to track their access and transmit data securely in and out of the
database.
The chapter also points that RDBMS are better off than
the NoSQL since they have quite matured security system. In case of NOSQL it is
still not matured. They have discussed that the security can be shared at the
application and the database level. Then the authors tried to give a map of
various techniques and their relative advantages overs each other
Authors have taken few case studies to show how popular
services implement their securities and to what granularity. Amazon S3 provides
the bucket level security while Apache Accumulo the authorization is applied to
each of the key value pair. The user can be denied access to particular keys.
In the last chapter, Authors have pointed that it is not
a simple exercise to decide upon the right database solution for the project. Non-familiarity
with the NoSQL paradigm might be the greatest hurdle for the project team to
accept it as a data base option. They might be biased towards a particular
RDBMS technology which they have been using for decades.
Authors have presented an architectural trade off analysis to objectively select the right DB
that’s the best fit for a business problem. This is basically intended to list
and prioritize the business requirements and then to score the effort required
to choose a particular NoSQL solution to implement them. There is a long list
of dos and don’ts for creating an architectural team. Then there are guidelines
of how to go about doing the trade-off analysis.
Authors have
suggested that apart from comparing the database solutions from technical
capability like indexes, query support, web access, they should also be
compared on the quality parameters like scalability, availability, portability,
searchability, agility.
Authors suggest preferring solutions that can be
deployed on the cloud as they will reduce the infrastructure cost and at the
same time scalability can be taken care of.
Authors have given a nice diagram (Quality Tree) that shows
how the different quality parameters can be quantified in terms of specific
features like searchability: transforms to full-text search, xml search, custom
scoring and how each of the features are prioritized for the organization. Sometimes
the quality tree helps in bringing the stake holders on the same page specially
in case when the there is a mixed audience and they might not be conversant
with the technological jargons
I
think this chapter is meant for the project managers or the architects who are
trying to get the nod of the management and the other stakeholders. Techniques
like the quality tree, architectural trade off analysis that will help in preparing
a strong case and driving the point across the table.
To summarize:
One problem I have seen consistently with this
book is that the introduction section in each chapter lists down some topics
which are intended to be covered in that chapter but the coverage is not
proper. Few of the topics are not covered or are insignificantly covered. I
think, that might get resolved in future editions of this book. Even the title
of case studies and the content of case studies are not similar. Sometimes the
case studies are more on a product than the concept that was being discussed in
the chapter.
Authors generally
introduce the topic in a very simple manner giving examples and pictures. The
case studies are something which can be a very good way to show the
applicability of the topic in the real life system so that the readers can know
how to apply the concepts. Unfortunately even though each chapter has lot of
case studies but with almost all the case studies there are missing information
or digressions which do not help the author in effectively summarizing the
topic. They are good to read independently but when you try to see the applicability
of the topic in discussion I find some gaps. Hopefully these gaps will be filled
in the upcoming editions
Even though there are problems with the content but
there are advantages too. The biggest advantage of this book is that it has
covered topics like full text search, high availability, map reduce using NoSQL
database which generally other books don’t cover or maybe they cover it in
their further reading sections. So it is a good book to get the introduction of
the complexity and opportunities provided by NoSQL databases. The case studies
point out how complex scenarios can be resolved using NoSQL solutions which
were very difficult to implement few years back.
My
recommendation is that this book should be read to get a wide perspective of
the subject and not to become master of the subject.