Friday, January 24, 2014

Review of Mannings Publications latest book "Making Sense of NoSQL"


I have recently finished reading this book and thought of writing my opinion about the book. I have mixed feeling about the book. From my viewpoint the book has several good points while there are few areas where I was expecting more but got little disappointed. I will point out what I did not like as I give my opinion about chapters.

Like most of the NoSQL books the authors start by giving an overview of early NoSQL solutions and the reason why they came into existence. They have given 5 case studies to explain how some of the companies came up with their proprietary solutions to solve their problems of big data. These solutions laid the foundation of the current NoSQL products.

The good part in this book is that the authors first introduce a concept and then they present a case study around the concept to tell the readers how the concept is used in the real world. This is very important since when a technology is new it is very difficult to envisage how to use it in our projects.

Authors have very clearly mentioned the difference between the RDBMS and NoSQL. Authors are not biased towards NoSQL and give a practical opinion on the two techniques. They have tried to build the fundamentals from ground up by giving the usage patterns, the terminologies, examples, key features of the NoSQL solutions
Almost every book tries to explain ACID, BASE, CAP without giving any practical usage. But in this book authors have devoted lot of pages to explain the fundamentals & to suggest how it makes RDBMS different than NoSQL. The authors have explained CAP theorem (applicable only in case of network disconnection) in a very effective manner and I guess this is the best description of the theorem compared to any book on similar topic.
In few chapters there is a “Apply your Knowledge” at the end which tries to give a real world situation and tries to find out the solution based on the concepts build in the chapter. This is a very good way of showing the utility of the concepts to the audience. But I think this section could have been much better. Usually the arguments given to choose the options are not good or the authors will directly tell which the correct solution is without discussing much about how to select that approach. In short the audience will understand what the solution for the given problem should be, but the approach to reach there might not be that clear.

Authors highlight that SQL Joins is the reason why RDBMS is not efficient in cluster environment as the tables can be spread on multiple systems and then to find the join will become highly complex. They also suggest that “Transactions” is the main reason why RDBMS gained so much popularity specially in the e-commerce & banking. They have dedicated a large chunk of chapter 3 on Transactions and their importance. NoSQL does not have transactions built in & it is one of the reasons why enterprises are little skeptical on their acceptance.
Authors have suggested many places that it is practical to have a combination of the RDBMS & NoSQL. One such use case highlighted is that of the OLAP

In chapter 4 authors have formally introduced to the various NoSQL architectures on which most of the solutions are based. One of the unique information (different from other books) is the effect of the system environment like RAM, SSD on the performance & capability of the NoSQL systems. To understand how H/W variations can impact the choice is interesting to know especially when most of them are Cloud based solutions and a consumer we can choose what configuration we want from the cloud vendor. You can define the various Quality of service kind of parameters like max read time, max write time, replication factor etc.  Depending on the parameters the cost of the solution will vary.  So basically we have a list of parameters that determine the cost of the NoSQL solutions. There is a very good analogy of the radio to explain how the various parameters lead to the cost of the solution
However I am little disappointed the content of the chapter4 and feel that more content could have been added.
Authors could have given some more examples of the key-value stores like to maintain the shopping cart, User profiles. The high available systems where he points out the DNS, directory services could have been covered as the use cases for the Key value stores.

Authors did not clearly tell why the query in the key store will be fast or what are the ways in which query engine will go about searching for the information. It is also not very clearly highlighted how a key value store is different than a RDBMS table with two columns either both are blobs or string-Blob. Also when not to use the Key–value is not highlighted.

In general I feel that the book does not have much of hands on details. It is good from a theoretical point but not from the practical perspective like how to actually write a query in a popular NoSQL solution. It says that with the help of the graph database the user can perform relational queries but how to actually go about doing that is not mentioned

I feel the introduction of the type of the NoSQL databases could be more structured and could have revolved around some common features like how well ACID properties are supported, How easy it is to query (somewhat covered ), How well is the distribution of data managed and some common examples of the existing databases. This gives a better understanding of the reason why we have different kinds of NoSql architectures
They have tried to explain the column family databases, but I guess a proper example where author are explaining how to create a key would have been very helpful.
They have given 3 examples of the column family database and tried to explain how the solution will help the users, but along with this they should have given a small sketch of entries that will be actually written in such databases. I could not understand how it will be easy to retrieve the data from the column family database compared to other solutions.

Document oriented database has been given less description. Again as a reader I would not want to refer to multiple books to understand each of these 4 families. This book should have given me a clear picture on all the 4 types.

In chapter 6 Authors have introduced to the various NoSQL database solutions and why the organizations are moving towards the NoSQL solution. Gives a list of use cases where NoSQL can be helpful like event logs, remote sensor data, trends from social media data.

Authors highlight the fact that the NoSQL databases generally provide an easy to use interface to perform really complex tasks. This is one of the reasons why the NoSQL projects are being used without a steep learning curve.
Big Data problems have various scaling needs based on their domain especially the Linear scalability which is one of the biggest problems with the RDBMS solutions & has been an inherent feature of almost all the NoSQL offerings. With so many offerings available in the market the solutions might be chosen based on the query expressivity and the degree of scalability. For e.g. key value pairs are most scalable but least expressive as your queries can only be on the keys. The document stores are most expressive as you can query on all the fields of the stored document record

Authors have tried to give a wide spectrum of use cases and the applicability of various NoSQL solutions. They have pointed out a different type of NoSQL solution of a particular category of use case
I think one of the weakest parts in this book is the editing. Some of the information given in chapter 6 should have been in the chapter 4 as it is vital to understand the difference in the various architectures of NoSQL databases but you have to wait for later chapters to get those details. One of the diagrams that point out lot of important aspects of the various types of NoSQL should have been in chapter4 rather than in chapter 6.

This book is different from many other books on the subject which only talk about the various types of database and compare them against each other, talk about the basic concepts like ACID BASE and then end the discussion. This book tries to bring in the usage aspect. How we can use the NoSQL in conjunction with other technologies to add value to the business like how we can improve the ETL process, How we can make the documents that are stored in multiple locations searchable, how to make use of map reduce algorithms with Hadoop to get the batch processing done on large data

I find some resource sharing of the concepts like Shared RAM, Shared Disk, Shared-nothing are quite interesting and have not been covered in other books in such a great details. Authors have tried to give the practical implications of each of the physical architecture on the database performance. They even say that knowing the hardware options available to big data is important first step in choosing the NoSQL

Then they discuss the distributed processing models of Master-Slave and peer-peer architecture. peer to peer is more complex but less prone to failures hence provide high availability. Master-Slave is less complex but is prone to single point failure

The authors have given a high level overview of the map-reduce paradigm. Authors have highlighted that one of the main concern of the map-reduce program is the uniform distribution of the tasks to all the nodes in the cluster. If the load is not uniform then the performance will get a hit as too much of work will be done by the single node. They also pointed out that one of the selection criteria NoSQL solution is how well it gets integrated with the Hadoop system

Authors discuss ways how the data is handled like distributed query, data distribution over nodes, replications so that time to answer the query can be reduced when dealing with large amount of data
Basically they have tried to tell that data distribution along with distributed processing is the key to NoSQL success
They have introduce Apache Flume which is one of the popular ways to analyze the event logs with the help of Hadoop and HDFS

They have given a case study to give a practical use of the NoSQL solutions in analyzing the distributed event logs in an enterprise. But I find that though the introduction is good and you are eager to know how it is done but you don’t get answers to lot of the actors like the fast channel For e.g. how will you filter the critical events from all the incoming events. What is this fast channel? Is it a notification server or some kind of RDBMS?.

With respect to the graph database the authors suggest that they can be used in the healthcare industry to find the frauds in the healthcare frauds. Authors gave an overview of the problem domain but did not explain the nuts and bolts of how to create the large shared RAM which can be extended in the future.
Some of the questions remain unanswered: Is there any size limitation on building the shared memory. What is the reliability of the shared RAM? Can a failure of one individual RAM chip bring down the entire infrastructure?

In the chapter 7 Authors touch upon a very relevant and important topic of how to fetch the data from the NoSQL in a timely manner with high recall and precision.

They have pointed that usually the NoSQL databases will combine some well-established search library like Apache Lucene or Apache Solr to provide reliable full text functionality.

Authors explain the key terms used while building search functionality like stemming, indexing, proximity search, scores, boosting, rank, storage strategy for the indexes and the actual data.

Authors have pointed out the use of map-reduce in creating the reverse index. In this chapter also they have tried to give the practical aspect of using the technology to solve the real life problems.

Authors have presented several case studies to explain how the search functionality finds it usage and in order to build that capability what needs to be done .They have given an example of searching technical documentation where they have explained the concept of boosting in detail and how to create a search engine for finding the correct chart in financial enterprise using the XML based database and Lucene. Finally they describe a common problem present in every software organization where you have lot of project documents (SDLC documents) in various formats like docx, pdf, jpeg but no easy way to find the information from them. They suggest how NoSQL can be of use in this scenario.

In chapter 8 Authors have given a good introduction to the subject of high availability taking an e-commerce website example. They have tried to point out the business impact of database going down while the customer is online. They have explained various jargons associated with High availability like Failure metrics, automatic failover, Client yield, Harvest Metric, Load balancing, Clusters, and Replication etc.
Authors have emphasized on the design pattern which advocates for moving query to the data and not the data to the query to save the time & network bandwidth in transferring large chunk of data from one node to the processing node. Also the query can be distributed and can use the processing power of the various nodes (shared-nothing architecture).

Authors have presented the case study of 3 popular NoSQL solutions which are known for their high availability features. They point out that Amazon DynamoDB provides the flexibility to the user to choose the read and write throughputs, type of read consistency & scale up or down to support the elastic demand. However in this case study I did not find clear picture of how the high availability is maintained in DynamoDB. What I understood was how flexible the option is.

Authors have then presented a case study on Cassandra & Couchbase and how it meets the high availability expectations. The discussion is brief and is approached from reliability & high availability perspective

I did not like Chapter 9  which is on agility because the only take away for me was that NoSQL databases are schema less so it is easy to change the fields, which I think has been covered in the introductory chapter. The case study did not help me to understand the agility part and I doubt if the arguments are correct. I think this chapter can be avoided or the main points be included in some earlier chapters

Chapter 10 is a heavy weight chapter and should be read with relaxed mind. I however don’t understand the utility of this chapter in the NoSQL book. This chapter forms the fundamentals of the functional programming and hence the map-reduce. The chapter provides very high level overview of the functional programming but since the topic is complex, high level description does not serve much purpose. I think the section where the authors have given a comparison of imperative and functional programming with a diagram that could be all that they should have provided followed by the map-reduce. This could have therefore moved to the chapter 6 where they introduce the map-reduce.

 “Apply your knowledge section” does not connect the dots. How is the content related to NoSQL and why it could not have been in a functional programming book? Also the whole argument of cache and relating to functional programming does not make sense. But again this is my view.

The only good part is the introduction to Erlang as it gives an idea as to why it is becoming so popular.

Chapter 11 has nice details on implicit requirement to have security feature in various NoSQL databases. This is a nicely written content and is an easy read. Authors help us to first understand the fundamentals involved with building the security in a database solution which boils down to 4 key aspects: authentication, authorization, audit, and encryption processes. You want to make sure that only the right people have access to the appropriate data in your database. You also want to track their access and transmit data securely in and out of the database.
The chapter also points that RDBMS are better off than the NoSQL since they have quite matured security system. In case of NOSQL it is still not matured. They have discussed that the security can be shared at the application and the database level. Then the authors tried to give a map of various techniques and their relative advantages overs each other

Authors have taken few case studies to show how popular services implement their securities and to what granularity. Amazon S3 provides the bucket level security while Apache Accumulo the authorization is applied to each of the key value pair. The user can be denied access to particular keys.
In the last chapter, Authors have pointed that it is not a simple exercise to decide upon the right database solution for the project. Non-familiarity with the NoSQL paradigm might be the greatest hurdle for the project team to accept it as a data base option. They might be biased towards a particular RDBMS technology which they have been using for decades.

Authors have presented an architectural trade off analysis to objectively select the right DB that’s the best fit for a business problem. This is basically intended to list and prioritize the business requirements and then to score the effort required to choose a particular NoSQL solution to implement them. There is a long list of dos and don’ts for creating an architectural team. Then there are guidelines of how to go about doing the trade-off analysis.

Authors have suggested that apart from comparing the database solutions from technical capability like indexes, query support, web access, they should also be compared on the quality parameters like scalability, availability, portability, searchability, agility.

Authors suggest preferring solutions that can be deployed on the cloud as they will reduce the infrastructure cost and at the same time scalability can be taken care of.

Authors have given a nice diagram (Quality Tree) that shows how the different quality parameters can be quantified in terms of specific features like searchability: transforms to full-text search, xml search, custom scoring and how each of the features are prioritized for the organization. Sometimes the quality tree helps in bringing the stake holders on the same page specially in case when the there is a mixed audience and they might not be conversant with the technological jargons

I think this chapter is meant for the project managers or the architects who are trying to get the nod of the management and the other stakeholders. Techniques like the quality tree, architectural trade off analysis that will help in preparing a strong case and driving the point across the table.

To summarize:

One problem I have seen consistently with this book is that the introduction section in each chapter lists down some topics which are intended to be covered in that chapter but the coverage is not proper. Few of the topics are not covered or are insignificantly covered. I think, that might get resolved in future editions of this book. Even the title of case studies and the content of case studies are not similar. Sometimes the case studies are more on a product than the concept that was being discussed in the chapter.

Authors generally introduce the topic in a very simple manner giving examples and pictures. The case studies are something which can be a very good way to show the applicability of the topic in the real life system so that the readers can know how to apply the concepts. Unfortunately even though each chapter has lot of case studies but with almost all the case studies there are missing information or digressions which do not help the author in effectively summarizing the topic. They are good to read independently but when you try to see the applicability of the topic in discussion I find some gaps. Hopefully these gaps will be filled in the upcoming editions

Even though there are problems with the content but there are advantages too. The biggest advantage of this book is that it has covered topics like full text search, high availability, map reduce using NoSQL database which generally other books don’t cover or maybe they cover it in their further reading sections. So it is a good book to get the introduction of the complexity and opportunities provided by NoSQL databases. The case studies point out how complex scenarios can be resolved using NoSQL solutions which were very difficult to implement few years back.


My recommendation is that this book should be read to get a wide perspective of the subject and not to become master of the subject.

Monday, December 2, 2013

A Software engineer perspective on IOT and its related technologies

Introduction

IOT (Internet of Things) in simple terms is to connect with everything that has an IP together. Every device  these days has an IP, thanks to the IPV6 and so can be connected together. Devices have become intelligent and can produce data that can vary from few bits (temperature, pressure sensors) to multimedia files (Traffic cameras sending live feeds, photos).

Basic idea of the IOT is that the devices can keep transferring the data to a backend server and the other devices can access that, or they might be communicating with each other to make decisions.

Today it is possible for a sensor to send the tweets every sec with its reading. Latest Mars rover “Curosity” can communicate to twitter and update “Whats happening at Mars”. (https://twitter.com/MarsCuriosity)

The voice recorders can transfer the voice data to a central server, used generally in the BPO industry to record the conversations happening between a BPO executive and the customer.

Mobiles, Wireless sensors, RFID tags, surveillance cameras, routers are some of the common devices generating data and sending to the servers. In this article I have used sensors and devices interchangeably.

Since the devices are interconnected globally there can be potentially billions of devices connected together and with the kind of data being generated it is massive data that needs to be collected every day.

Why is it really important to gather all the data from the sensors? It seems that the analytics on the data can provide huge optimization opportunities to the industrial processes which can result in massive dollar savings

GE, the industrial giants claim that if they can perform the analytics on the data collected for their Airplane engines it can result in 1% fuel efficiency which has a potential to save about 1Bn$ per year. They call it as the Power of 1. The savings come from the fact that with all the data analytics they can provide pro-active care to the engines rather than reactive care. The servicing can be on need basis rather than as per schedule which can remove keep the parts moving as required. This also means no unplanned downtime. The service happens when the part is about to get wear down and not when it has stopped working. The data from the engine will create a service request if the analytics finds that the part requires some servicing. (http://www.ge.com/docs/chapters/Industrial_Internet.pdf)

This is mind boggling. You might ask is this correct or just a marketing gimmick. Actually fuel saving has a massive effect on the costs of operations. A very rough estimate suggest that if people of a small suburb of Bangalore use public transport to go to their work instead of using their personal vehicles, it could save up to $50000 worth of fuel per day. (http://rohitagarwal24.blogspot.in/2013/10/fun-with-number-save-fuel-save-dollars.html)

Some of the popular use cases for appreciating the power of IOT

  1. Insurance premium can be customized to the actual risks of operating a vehicle rather than based on the proxies such as driver’s age, gender, or place of residence
  2. In retailing the sensors embedded in the Membership card can note the shopper’s profile data based on the shopping list. This data can be sent to the server. Next time when the same shopper comes back, he/she can be offered some offers at the point of sale
  3.  Airplane, rail engines can send continuous data about their wear and tear to the central computer allowing for a proactive maintenance rather than a scheduled one, reducing the unplanned downtime.  It is estimated that 22 Billion dollars is wasted annually by commercial airlines due to flight delays and unexpected fuel consumption, this further strengthens the notion that analytics might play a big role in reducing this wastage
  4.  Sensor network based on video feed, audio, vibration detectors can spot the unauthorized individuals. This has increased the ability of the security personnel to detect trespassers


In all use cases one thing to note is the analytics. Based on the data sent out by the device the application needs to build analytics and then take some actions/decisions

I was amazed to know that the wind turbines these days have a capability to connect to another wind turbine to check the wind speed at that point and try to balance out the amount of wind power among each other, thereby optimizing the use of wind for energy generation this is also called as “Machine to Machine Communication” http://en.wikipedia.org/wiki/Machine_to_machine

Issues with IOT

Before getting hooked to the new technology it is very important to know the problems that come along with it.
There are three important bottlenecks while working with the sensors or so called things: Storage, Battery (Power), Connectivity

Storage Problem

It is not possible for a device to store all the sensor data locally in the device as they don’t have that kind of storage. For e.g. a Traffic camera can generate thousands of mega pixel images. It is not possible to store everything in the camera. Also what if the camera is damaged or is stolen, and then the data is lost. So the common approach is to transfer the data to the backend database where it gets stored.
But the storage to a backend server is not as simple as it looks like.

It is very clear that the data is heterogeneous in nature (images, video, sensor readings). Also the data is mostly unstructured like the videos, log files.
We cannot use the Relational database for storing them as they don’t handle the unstructured data very efficiently.

All you can do is save it a blob, which means you cannot query the data that is inside the blob and hence it will not allow you to retrieve the data for analysis.  Also there is a limit to the amount of data that can be placed in a RDBMS. Imagine you have collected video streams from the traffic cameras across the city like Delhi. We can safely assume there will be at least 100 traffic cameras. Assume that each camera is sending one image per second. If the image is say 500KB, then you have 100*500KB=50MB of data per second. Multiply by 24*3600 seconds which makes it 50MB*3600*24=4220GB or 4.3TB of data per day!!! This is a huge data to be stored on a single computer.  The storage disk will run out soon. Also to query such a massive data will take forever with a regular SQL solution. The obvious choice is data federation over cluster. But the regular RDBMS solution does not scale well for clusters.

To cater to the problem of heterogeneity and massive data NOSQL databases come to rescue. We will discuss on that in the later section

Battery Issue

One important consideration is the energy or the battery consumption in the devices. In most remote sensors it is common that they are running on battery power (they might not be connected to the constant power supply). If they are constantly transferring the data then their battery might soon drain out and you will stop getting the data from the sensors. So the energy management has to be in place

One approach is to send burst of data at specific intervals to the central server instead of sending data constantly. This will help in saving the battery cost involved in remaining connected.

But the burst approach is not feasible all the time. In some scenarios like traffic cameras where the real time data has to be transmitted from one camera to control room, the sensors are always online and in transmission mode

Or intelligent machines like the wind turbines which control themselves based on the wind speed, rains, transmit data to a central computer and get the analytics back.

Or the sensors in a plant are in constant transmission mode. They are all active sensors and they need to be always connected. But the passive sensors can defer the connection on need basis. In such scenarios constant source of power is inevitable

Connectivity

Another critical problem is the unreliable or limited internet connectivity. There are scenarios where the sensors are deployed in remote locations which have a very limited / no internet connectivity. One approach to get data from these sensors is using a temporary connection once in a day/week depending on the criticality of data and amount of storage at sensor. This will get the sensor connected to network and the data can be transferred. Some of the techniques that are being tried are Google Balloon where they use balloons to provide the connectivity (http://en.wikipedia.org/wiki/Project_Loon).
There have been attempts where a vehicle was used to connect the nodes (data mule)
(Data MULEs: Modeling a Three-tier Architecture for Sparse Sensor Networks

This approach exploits the presence of mobile entities (called MULEs) present in the environment. MULEs pick up data from the sensors when in close range, buffer it, and drop off the data to wired access points. This addresses the connectivity issue and at the same time it can lead to substantial power savings at the sensors as they only have to transmit over a short range (Just to the vehicle)

 

This model of data mule is more suitable to the passive sensors, but for the active sensors like the traffic cameras the connectivity has to be available all the time

Understanding a technical Framework for employing IOT based solution

IOT based solution is the one where you are able to analyze the data coming from the machines (devices) to predict something useful for the existing workflow.
Based on my understanding of the IOT and its usage, there is a technical infrastructure that needs to be put together.  I have tried to draw a rough diagram of what can be called as a IOT stack



Devices/Sensors:

They are the physical sensors/devices that make the core of the IOT. It is assumed that the sensors or the thing is able to gather some data and that data can be sent to some central backend sensor for detailed analysis. Some common IOT devices are temperature/pressure sensors, routers, cameras

Communication Protocol:

Since the devices are constraint by the battery power, bandwidth availability and memory, the regular TCP/IP protocol is too bulky for practical use. The main problem with the standard communication protocols is that they have an overhead of sending large headers in each of the packet. They were designed for the PCs, workstations where the memory and bandwidth is not a major issue. In a TCP packet over IPV6 there are huge number of bytes used for sending header information and the real data is very less. This is not an optimized solution as the sensors will burn out the battery sending the heavy headers rather than the actual data. For this reason new protocol standards have been proposed like 6LoWPAN (http://en.wikipedia.org/wiki/6LoWPAN) that have less overheads.

BigData based storage:

We had discussed earlier that the sensor/IOT data needs is huge and is heterogeneous in nature. In this section I will try to elaborate more on this aspect.
For e.g. we can have the sensors producing the data in the following formants
e.g.
Sensor1Reading => 10.5,3 (temperature, pressure)
Sendor2Reading => img
Sensor3Reading=>Log

In case of the RDBMS it is all about schema and tables. No single schema can satisfy the 3 different kinds of data.
One way is to store it as a blob
So we can have the schema as 
Sensor1, blob1
Sensor2, blob2
Sensor3, blob3

But the problem is that we cannot write a query to say give me the average temperature recorded by the sensor1 as blobs are considered as binary data and the sql query will not run into the content of the blob. One of the key reason for using a database is to be able to efficiently query the data which is obviously not getting satisfied in this case

In this situation the big data or NoSQL based solutions come to rescue
In very simple terms NoSQL means a schema less database, where the storage is mostly on the basis of key value pairs. The keys and values can be anything. So the above data could be strored as
e.g.
Sensor1Reading => 10.5,3 (temperature, pressure)
Sendor2Reading => img
Sensor3Reading=>Log
The big data allows the user to query in the values of the keys, so that resolves the querying limitations of blob.
Another important advantage provided by the big data solutions is the ease of changing the data set or the schema stored. For e.g. if in future you want to add a location parameter with sensor1 value, it can be done easily. Basically just start adding the geolocation in the value while storing it in the bigdata so the entries will look like
Sensor1Reading => 10.5,3,loc1 (temperature, pressure,locationdata)
Sendor2Reading => img
Sensor3Reading=>Log

Imagine the same thing in case of the RDBMS. You would have to add a new column in the table, and then see if the foreign key constraints are not violated. This problem will explode if you have to keep adding new fields.

The query time is another major reason for not considering the RDBMS solutions. In case of RDBMS the search happens in a linear fashion. So if it a million record table and your data of interest is at the millionth row it will take million*time to seek a record amount of time. The disk I/O will have a bottle neck and surely the retrieval will take huge amount of time.
The main problem is that RDBMS does not efficiently support parallelization for query execution. Generally they don’t even work on cluster based solutions. On the other hand big data solutions fundamentally supports cluster based data federation. Any query will be executed in parallel on multiple nodes and the search time is drastically reduced.

Basically you need to answer the following questions before choosing the right database solution
  1. How to define the data (schema). Can it support unstructured data?
  2. What is the limit of data storage? How to scale the data. Can I add more disks and distribute the data.
  3. How easy is it to change the schema of the data given the fact that I might add new sensor which is producing different kind of data
  4. How to query the data with accuracy and efficiency (Query model) . Without the ability to query the data it is not more than the file folders. Also what is the time involved in querying. Some of the interesting queries can be

·         Query all the plate numbers of the cars which were speeding above 80KM/Hr between the time period of 20-09-2013 to 24-09-2013 in the Cannought Place area in New Delhi.
o   so we are interested in the traffic surveillance records coming from camera & speed sensors. There will be an image processing involved since we are interested in the number plate
·         Query the flight path of the flight number 123 on 20-10-2008
o   Since the data is 6 years old so search can take some time. This the challenge that BigData solution have to mandatory resolve before being even considered as a potential solution

·         Query the Nearest 10 mobile phones  location from the GPS position (22,77)
o   Need to perform lot of distance calculations between 1000’s of mobile phone in that region and yet return back the result in few seconds

·         Find out my facebook friends.
The query looks simple but you could be  200million name in the database. A RDBMS query will work sequentially. At any speed, it will take few minutes to hours to just find your name in the database, then to perform join going to bring down the Facebook server
·         Find out if the person has already completed the scans for Adhaar Card.
o   This will match with Iris scans and Finger Scans. Even if 1Crore (10 million) people have been registered imagine the computing power required. An surprisingly the matches are verified in less than a minute.

Some of the popular NoSQL databases are MongoDB (http://www.mongodb.com/nosql), CouchDB (http://cassandra.apache.org/), Cassandra (http://couchdb.apache.org/), HBase (http://hbase.apache.org/)

Some good references to understand big data solutions are:

2.       NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, Pramod J. Sadalage, Martin fowler, http://www.amazon.in/NoSQL-Distilled-Emerging-Polyglot-Persistence/dp/0321826620
3.       A Survey on Cloud Database, Deka Ganesh Chandra
4.       A comparison between several NoSQL databases with comments and notes, Bogdan George Tudorica, Cristian Bucur
5.       Cache and Consistency in NOSQL, Peng Xiang, Ruichun Hou and Zhiming Zhou
6.       A Storage Infrastructure for Heterogeneous and Multimedia Data in the Internet of Things, Mario Di Francesco, Na Li, Mayank Raj, Sajal K. Das, 2012 IEEE International Conference on Green Computing and Communications, Conference on Internet of Things, and Conference on Cyber, Physical and Social Computing
7.       A Storage Solution for Massive IoT Data Based on NoSQL, Tingli LI, Yang LIU, Ye TIAN, Shuo SHEN, Wei MAO, , 2012 IEEE International Conference on Green Computing and Communications, Conference on Internet of Things, and Conference
on Cyber, Physical and Social Computing
8.       Social-Network-Sourced Big Data Analytics, Wei Tan, M. Brian Blake, Iman Saleh,Schahram Dustdar, IEEE internet Computing 2013

Analytic Engine:

 All the use cases of IOT indicate that analysis of the collected data is the key to make intelligent decisions and provide optimizations.
We have the data in the big data solution but it is of not much use unless we are able to analyze the data and find out some patterns to draw conclusions.
But given the fact that the amount of data can run into Peta bytes, distributed computing is inevitable.
For e.g. Let’s say that from the web server logs stored in the big data solution we need to find out which keywords have been searched the most.

The logs can run into peta bytes (imagine the logs that Google collect about the user search queries). A naïve approach to analyze these logs is to parse the log using some regular expressions for capturing the keywords, then counting them and creating aggregations.
This is quite time consuming and will take a huge amount of time if run on a single node. This is a classic case of distributed computing where the data can be distributed on the different nodes and then analysis can be run only on that smaller set. At the end the results can be collated together.

These kind of problems fall under a paradigm called map-reduce. In order to ease the work of users there are frameworks available that help the users to write the map-reduce jobs and get their analysis done.

Hadoop framework is one of the most popular map-reduce framework that provides the capability to process large data in highly parallel fashion. It is capable of reading the data from an existing database source, stores the data in its own filesystem called HDFS and then process the data using a map reduce algorithm. Map reduce algorithm basically defines what needs to be done on the data. For e.g. user want to run regex on each line of the log file, or image processing algorithm on each image file stored in the database. Map reduce will make sure that all the desired algorithm is applied to each unit of data in a distributed fashion.

Some bigdata solutions like MongoDB, CouchDB, Cassandra have inbuilt capability to provide map reduce functionality while other require Hadoop kind of frameworks to process 
Details about map reduce and Hadoop outside the scope of this article. There are good references and books available to get started.

1.       Hadoop in Action, Chuck Lam, Manning Publications, 2011
2.       MapReduce Design Patterns, Donald Miner & Adam Shook, O’reilly Publications, 2012
3.       Hadoop The Definitive Guide, Tom White, O’reilly Publications, 3rd edition, 2012


Integration with the cloud:


Cloud integration is a fundamental requirement for the IOT based frameworks.
The cloud provides 3 main advantages in this perspective
1.      Storage: With the storage requirements running into petabytes it is not feasible to buy the personal storage solutions for most of the companies. Cloud provides a feasible solutions by providing storage of any order in an affordable  fashion

2.      Computing: As we saw that analysis might be heavy weight with the amount of data to be analyzed and the algorithm to be run on each unit of data. A single node cannot be used to run the analytics. At the same time to build a cluster solution is expansive for most organizations. The cloud provides on demand computing power by providing multiple nodes on which the computing can be distributed

3.      SOA: The data and the analytics might be required by multiple modules and devices. Cloud provides an easy way to create services that will provide the data (data stored in the big data database) and the analytical  results though REST calls

Conclusion

I have tried to give high level overview of what is IOT, how is it useful in improving the processes and saving operational costs,  what is the technological landscape that needs to be understood before creating a useful solution based on IOT.
I would strongly recommend to read the references that have been included in various sections to get the in depth understanding of the topic.