Monday, December 2, 2013

A Software engineer perspective on IOT and its related technologies

Introduction

IOT (Internet of Things) in simple terms is to connect with everything that has an IP together. Every device  these days has an IP, thanks to the IPV6 and so can be connected together. Devices have become intelligent and can produce data that can vary from few bits (temperature, pressure sensors) to multimedia files (Traffic cameras sending live feeds, photos).

Basic idea of the IOT is that the devices can keep transferring the data to a backend server and the other devices can access that, or they might be communicating with each other to make decisions.

Today it is possible for a sensor to send the tweets every sec with its reading. Latest Mars rover “Curosity” can communicate to twitter and update “Whats happening at Mars”. (https://twitter.com/MarsCuriosity)

The voice recorders can transfer the voice data to a central server, used generally in the BPO industry to record the conversations happening between a BPO executive and the customer.

Mobiles, Wireless sensors, RFID tags, surveillance cameras, routers are some of the common devices generating data and sending to the servers. In this article I have used sensors and devices interchangeably.

Since the devices are interconnected globally there can be potentially billions of devices connected together and with the kind of data being generated it is massive data that needs to be collected every day.

Why is it really important to gather all the data from the sensors? It seems that the analytics on the data can provide huge optimization opportunities to the industrial processes which can result in massive dollar savings

GE, the industrial giants claim that if they can perform the analytics on the data collected for their Airplane engines it can result in 1% fuel efficiency which has a potential to save about 1Bn$ per year. They call it as the Power of 1. The savings come from the fact that with all the data analytics they can provide pro-active care to the engines rather than reactive care. The servicing can be on need basis rather than as per schedule which can remove keep the parts moving as required. This also means no unplanned downtime. The service happens when the part is about to get wear down and not when it has stopped working. The data from the engine will create a service request if the analytics finds that the part requires some servicing. (http://www.ge.com/docs/chapters/Industrial_Internet.pdf)

This is mind boggling. You might ask is this correct or just a marketing gimmick. Actually fuel saving has a massive effect on the costs of operations. A very rough estimate suggest that if people of a small suburb of Bangalore use public transport to go to their work instead of using their personal vehicles, it could save up to $50000 worth of fuel per day. (http://rohitagarwal24.blogspot.in/2013/10/fun-with-number-save-fuel-save-dollars.html)

Some of the popular use cases for appreciating the power of IOT

  1. Insurance premium can be customized to the actual risks of operating a vehicle rather than based on the proxies such as driver’s age, gender, or place of residence
  2. In retailing the sensors embedded in the Membership card can note the shopper’s profile data based on the shopping list. This data can be sent to the server. Next time when the same shopper comes back, he/she can be offered some offers at the point of sale
  3.  Airplane, rail engines can send continuous data about their wear and tear to the central computer allowing for a proactive maintenance rather than a scheduled one, reducing the unplanned downtime.  It is estimated that 22 Billion dollars is wasted annually by commercial airlines due to flight delays and unexpected fuel consumption, this further strengthens the notion that analytics might play a big role in reducing this wastage
  4.  Sensor network based on video feed, audio, vibration detectors can spot the unauthorized individuals. This has increased the ability of the security personnel to detect trespassers


In all use cases one thing to note is the analytics. Based on the data sent out by the device the application needs to build analytics and then take some actions/decisions

I was amazed to know that the wind turbines these days have a capability to connect to another wind turbine to check the wind speed at that point and try to balance out the amount of wind power among each other, thereby optimizing the use of wind for energy generation this is also called as “Machine to Machine Communication” http://en.wikipedia.org/wiki/Machine_to_machine

Issues with IOT

Before getting hooked to the new technology it is very important to know the problems that come along with it.
There are three important bottlenecks while working with the sensors or so called things: Storage, Battery (Power), Connectivity

Storage Problem

It is not possible for a device to store all the sensor data locally in the device as they don’t have that kind of storage. For e.g. a Traffic camera can generate thousands of mega pixel images. It is not possible to store everything in the camera. Also what if the camera is damaged or is stolen, and then the data is lost. So the common approach is to transfer the data to the backend database where it gets stored.
But the storage to a backend server is not as simple as it looks like.

It is very clear that the data is heterogeneous in nature (images, video, sensor readings). Also the data is mostly unstructured like the videos, log files.
We cannot use the Relational database for storing them as they don’t handle the unstructured data very efficiently.

All you can do is save it a blob, which means you cannot query the data that is inside the blob and hence it will not allow you to retrieve the data for analysis.  Also there is a limit to the amount of data that can be placed in a RDBMS. Imagine you have collected video streams from the traffic cameras across the city like Delhi. We can safely assume there will be at least 100 traffic cameras. Assume that each camera is sending one image per second. If the image is say 500KB, then you have 100*500KB=50MB of data per second. Multiply by 24*3600 seconds which makes it 50MB*3600*24=4220GB or 4.3TB of data per day!!! This is a huge data to be stored on a single computer.  The storage disk will run out soon. Also to query such a massive data will take forever with a regular SQL solution. The obvious choice is data federation over cluster. But the regular RDBMS solution does not scale well for clusters.

To cater to the problem of heterogeneity and massive data NOSQL databases come to rescue. We will discuss on that in the later section

Battery Issue

One important consideration is the energy or the battery consumption in the devices. In most remote sensors it is common that they are running on battery power (they might not be connected to the constant power supply). If they are constantly transferring the data then their battery might soon drain out and you will stop getting the data from the sensors. So the energy management has to be in place

One approach is to send burst of data at specific intervals to the central server instead of sending data constantly. This will help in saving the battery cost involved in remaining connected.

But the burst approach is not feasible all the time. In some scenarios like traffic cameras where the real time data has to be transmitted from one camera to control room, the sensors are always online and in transmission mode

Or intelligent machines like the wind turbines which control themselves based on the wind speed, rains, transmit data to a central computer and get the analytics back.

Or the sensors in a plant are in constant transmission mode. They are all active sensors and they need to be always connected. But the passive sensors can defer the connection on need basis. In such scenarios constant source of power is inevitable

Connectivity

Another critical problem is the unreliable or limited internet connectivity. There are scenarios where the sensors are deployed in remote locations which have a very limited / no internet connectivity. One approach to get data from these sensors is using a temporary connection once in a day/week depending on the criticality of data and amount of storage at sensor. This will get the sensor connected to network and the data can be transferred. Some of the techniques that are being tried are Google Balloon where they use balloons to provide the connectivity (http://en.wikipedia.org/wiki/Project_Loon).
There have been attempts where a vehicle was used to connect the nodes (data mule)
(Data MULEs: Modeling a Three-tier Architecture for Sparse Sensor Networks

This approach exploits the presence of mobile entities (called MULEs) present in the environment. MULEs pick up data from the sensors when in close range, buffer it, and drop off the data to wired access points. This addresses the connectivity issue and at the same time it can lead to substantial power savings at the sensors as they only have to transmit over a short range (Just to the vehicle)

 

This model of data mule is more suitable to the passive sensors, but for the active sensors like the traffic cameras the connectivity has to be available all the time

Understanding a technical Framework for employing IOT based solution

IOT based solution is the one where you are able to analyze the data coming from the machines (devices) to predict something useful for the existing workflow.
Based on my understanding of the IOT and its usage, there is a technical infrastructure that needs to be put together.  I have tried to draw a rough diagram of what can be called as a IOT stack



Devices/Sensors:

They are the physical sensors/devices that make the core of the IOT. It is assumed that the sensors or the thing is able to gather some data and that data can be sent to some central backend sensor for detailed analysis. Some common IOT devices are temperature/pressure sensors, routers, cameras

Communication Protocol:

Since the devices are constraint by the battery power, bandwidth availability and memory, the regular TCP/IP protocol is too bulky for practical use. The main problem with the standard communication protocols is that they have an overhead of sending large headers in each of the packet. They were designed for the PCs, workstations where the memory and bandwidth is not a major issue. In a TCP packet over IPV6 there are huge number of bytes used for sending header information and the real data is very less. This is not an optimized solution as the sensors will burn out the battery sending the heavy headers rather than the actual data. For this reason new protocol standards have been proposed like 6LoWPAN (http://en.wikipedia.org/wiki/6LoWPAN) that have less overheads.

BigData based storage:

We had discussed earlier that the sensor/IOT data needs is huge and is heterogeneous in nature. In this section I will try to elaborate more on this aspect.
For e.g. we can have the sensors producing the data in the following formants
e.g.
Sensor1Reading => 10.5,3 (temperature, pressure)
Sendor2Reading => img
Sensor3Reading=>Log

In case of the RDBMS it is all about schema and tables. No single schema can satisfy the 3 different kinds of data.
One way is to store it as a blob
So we can have the schema as 
Sensor1, blob1
Sensor2, blob2
Sensor3, blob3

But the problem is that we cannot write a query to say give me the average temperature recorded by the sensor1 as blobs are considered as binary data and the sql query will not run into the content of the blob. One of the key reason for using a database is to be able to efficiently query the data which is obviously not getting satisfied in this case

In this situation the big data or NoSQL based solutions come to rescue
In very simple terms NoSQL means a schema less database, where the storage is mostly on the basis of key value pairs. The keys and values can be anything. So the above data could be strored as
e.g.
Sensor1Reading => 10.5,3 (temperature, pressure)
Sendor2Reading => img
Sensor3Reading=>Log
The big data allows the user to query in the values of the keys, so that resolves the querying limitations of blob.
Another important advantage provided by the big data solutions is the ease of changing the data set or the schema stored. For e.g. if in future you want to add a location parameter with sensor1 value, it can be done easily. Basically just start adding the geolocation in the value while storing it in the bigdata so the entries will look like
Sensor1Reading => 10.5,3,loc1 (temperature, pressure,locationdata)
Sendor2Reading => img
Sensor3Reading=>Log

Imagine the same thing in case of the RDBMS. You would have to add a new column in the table, and then see if the foreign key constraints are not violated. This problem will explode if you have to keep adding new fields.

The query time is another major reason for not considering the RDBMS solutions. In case of RDBMS the search happens in a linear fashion. So if it a million record table and your data of interest is at the millionth row it will take million*time to seek a record amount of time. The disk I/O will have a bottle neck and surely the retrieval will take huge amount of time.
The main problem is that RDBMS does not efficiently support parallelization for query execution. Generally they don’t even work on cluster based solutions. On the other hand big data solutions fundamentally supports cluster based data federation. Any query will be executed in parallel on multiple nodes and the search time is drastically reduced.

Basically you need to answer the following questions before choosing the right database solution
  1. How to define the data (schema). Can it support unstructured data?
  2. What is the limit of data storage? How to scale the data. Can I add more disks and distribute the data.
  3. How easy is it to change the schema of the data given the fact that I might add new sensor which is producing different kind of data
  4. How to query the data with accuracy and efficiency (Query model) . Without the ability to query the data it is not more than the file folders. Also what is the time involved in querying. Some of the interesting queries can be

·         Query all the plate numbers of the cars which were speeding above 80KM/Hr between the time period of 20-09-2013 to 24-09-2013 in the Cannought Place area in New Delhi.
o   so we are interested in the traffic surveillance records coming from camera & speed sensors. There will be an image processing involved since we are interested in the number plate
·         Query the flight path of the flight number 123 on 20-10-2008
o   Since the data is 6 years old so search can take some time. This the challenge that BigData solution have to mandatory resolve before being even considered as a potential solution

·         Query the Nearest 10 mobile phones  location from the GPS position (22,77)
o   Need to perform lot of distance calculations between 1000’s of mobile phone in that region and yet return back the result in few seconds

·         Find out my facebook friends.
The query looks simple but you could be  200million name in the database. A RDBMS query will work sequentially. At any speed, it will take few minutes to hours to just find your name in the database, then to perform join going to bring down the Facebook server
·         Find out if the person has already completed the scans for Adhaar Card.
o   This will match with Iris scans and Finger Scans. Even if 1Crore (10 million) people have been registered imagine the computing power required. An surprisingly the matches are verified in less than a minute.

Some of the popular NoSQL databases are MongoDB (http://www.mongodb.com/nosql), CouchDB (http://cassandra.apache.org/), Cassandra (http://couchdb.apache.org/), HBase (http://hbase.apache.org/)

Some good references to understand big data solutions are:

2.       NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, Pramod J. Sadalage, Martin fowler, http://www.amazon.in/NoSQL-Distilled-Emerging-Polyglot-Persistence/dp/0321826620
3.       A Survey on Cloud Database, Deka Ganesh Chandra
4.       A comparison between several NoSQL databases with comments and notes, Bogdan George Tudorica, Cristian Bucur
5.       Cache and Consistency in NOSQL, Peng Xiang, Ruichun Hou and Zhiming Zhou
6.       A Storage Infrastructure for Heterogeneous and Multimedia Data in the Internet of Things, Mario Di Francesco, Na Li, Mayank Raj, Sajal K. Das, 2012 IEEE International Conference on Green Computing and Communications, Conference on Internet of Things, and Conference on Cyber, Physical and Social Computing
7.       A Storage Solution for Massive IoT Data Based on NoSQL, Tingli LI, Yang LIU, Ye TIAN, Shuo SHEN, Wei MAO, , 2012 IEEE International Conference on Green Computing and Communications, Conference on Internet of Things, and Conference
on Cyber, Physical and Social Computing
8.       Social-Network-Sourced Big Data Analytics, Wei Tan, M. Brian Blake, Iman Saleh,Schahram Dustdar, IEEE internet Computing 2013

Analytic Engine:

 All the use cases of IOT indicate that analysis of the collected data is the key to make intelligent decisions and provide optimizations.
We have the data in the big data solution but it is of not much use unless we are able to analyze the data and find out some patterns to draw conclusions.
But given the fact that the amount of data can run into Peta bytes, distributed computing is inevitable.
For e.g. Let’s say that from the web server logs stored in the big data solution we need to find out which keywords have been searched the most.

The logs can run into peta bytes (imagine the logs that Google collect about the user search queries). A naïve approach to analyze these logs is to parse the log using some regular expressions for capturing the keywords, then counting them and creating aggregations.
This is quite time consuming and will take a huge amount of time if run on a single node. This is a classic case of distributed computing where the data can be distributed on the different nodes and then analysis can be run only on that smaller set. At the end the results can be collated together.

These kind of problems fall under a paradigm called map-reduce. In order to ease the work of users there are frameworks available that help the users to write the map-reduce jobs and get their analysis done.

Hadoop framework is one of the most popular map-reduce framework that provides the capability to process large data in highly parallel fashion. It is capable of reading the data from an existing database source, stores the data in its own filesystem called HDFS and then process the data using a map reduce algorithm. Map reduce algorithm basically defines what needs to be done on the data. For e.g. user want to run regex on each line of the log file, or image processing algorithm on each image file stored in the database. Map reduce will make sure that all the desired algorithm is applied to each unit of data in a distributed fashion.

Some bigdata solutions like MongoDB, CouchDB, Cassandra have inbuilt capability to provide map reduce functionality while other require Hadoop kind of frameworks to process 
Details about map reduce and Hadoop outside the scope of this article. There are good references and books available to get started.

1.       Hadoop in Action, Chuck Lam, Manning Publications, 2011
2.       MapReduce Design Patterns, Donald Miner & Adam Shook, O’reilly Publications, 2012
3.       Hadoop The Definitive Guide, Tom White, O’reilly Publications, 3rd edition, 2012


Integration with the cloud:


Cloud integration is a fundamental requirement for the IOT based frameworks.
The cloud provides 3 main advantages in this perspective
1.      Storage: With the storage requirements running into petabytes it is not feasible to buy the personal storage solutions for most of the companies. Cloud provides a feasible solutions by providing storage of any order in an affordable  fashion

2.      Computing: As we saw that analysis might be heavy weight with the amount of data to be analyzed and the algorithm to be run on each unit of data. A single node cannot be used to run the analytics. At the same time to build a cluster solution is expansive for most organizations. The cloud provides on demand computing power by providing multiple nodes on which the computing can be distributed

3.      SOA: The data and the analytics might be required by multiple modules and devices. Cloud provides an easy way to create services that will provide the data (data stored in the big data database) and the analytical  results though REST calls

Conclusion

I have tried to give high level overview of what is IOT, how is it useful in improving the processes and saving operational costs,  what is the technological landscape that needs to be understood before creating a useful solution based on IOT.
I would strongly recommend to read the references that have been included in various sections to get the in depth understanding of the topic.