IOT (Internet of Things) in simple terms is to connect with everything that has an IP together. Every device these days has an IP, thanks to the IPV6 and so can be connected together. Devices have become intelligent and can produce data that can vary from few bits (temperature, pressure sensors) to multimedia files (Traffic cameras sending live feeds, photos).
Basic idea of the IOT is that the devices can keep
transferring the data to a backend server and the other devices can access
that, or they might be communicating with each other to make decisions.
Today it is possible for a sensor to send the tweets
every sec with its reading. Latest Mars rover “Curosity” can communicate to
twitter and update “Whats happening at Mars”. (https://twitter.com/MarsCuriosity)
The voice recorders can transfer the voice data to a
central server, used generally in the BPO industry to record the conversations
happening between a BPO executive and the customer.
Mobiles, Wireless sensors, RFID tags, surveillance
cameras, routers are some of the common devices generating data and sending to
the servers. In this article I have used sensors and devices interchangeably.
Since the devices are interconnected globally there
can be potentially billions of devices connected together and with the kind of
data being generated it is massive data that needs to be collected every day.
Why is it really important
to gather all the data from the sensors? It seems that the analytics on the
data can provide huge optimization opportunities to the industrial processes
which can result in massive dollar savings
GE, the industrial giants claim that if they can
perform the analytics on the data collected for their Airplane engines it can result
in 1% fuel efficiency which has a potential to save about 1Bn$ per year. They
call it as the Power of 1. The
savings come from the fact that with all the data analytics they can provide
pro-active care to the engines rather than reactive care. The servicing can be
on need basis rather than as per schedule which can remove keep the parts
moving as required. This also means no unplanned downtime. The service happens when the part is about to get wear down and not when
it has stopped working. The data from the engine will create a service request
if the analytics finds that the part requires some servicing. (http://www.ge.com/docs/chapters/Industrial_Internet.pdf)
This is mind boggling. You might ask is this correct
or just a marketing gimmick. Actually fuel saving has a massive effect on the
costs of operations. A very rough estimate suggest that if people of a small
suburb of Bangalore use public transport to go to their work instead of using
their personal vehicles, it could save up to $50000 worth of fuel per day. (http://rohitagarwal24.blogspot.in/2013/10/fun-with-number-save-fuel-save-dollars.html)
Some of the popular use cases for appreciating the
power of IOT
- Insurance premium can be customized to the actual risks of operating a vehicle rather than based on the proxies such as driver’s age, gender, or place of residence
- In retailing the sensors embedded in the Membership card can note the shopper’s profile data based on the shopping list. This data can be sent to the server. Next time when the same shopper comes back, he/she can be offered some offers at the point of sale
- Airplane, rail engines can send continuous data about their wear and tear to the central computer allowing for a proactive maintenance rather than a scheduled one, reducing the unplanned downtime. It is estimated that 22 Billion dollars is wasted annually by commercial airlines due to flight delays and unexpected fuel consumption, this further strengthens the notion that analytics might play a big role in reducing this wastage
- Sensor network based on video feed, audio, vibration detectors can spot the unauthorized individuals. This has increased the ability of the security personnel to detect trespassers
Some more use cases have been listed at http://www.libelium.com/top_50_iot_sensor_applications_ranking/
In all use cases one thing to note is the analytics.
Based on the data sent out by the device the application needs to build analytics
and then take some actions/decisions
I was amazed to know that the wind turbines these
days have a capability to connect to another wind turbine to check the wind speed
at that point and try to balance out the amount of wind power among each other,
thereby optimizing the use of wind for energy generation this is also called as
“Machine to Machine Communication” http://en.wikipedia.org/wiki/Machine_to_machine
Before getting hooked to the new technology it is very important to know the problems that come along with it.
There are three important bottlenecks while working
with the sensors or so called things: Storage, Battery (Power), Connectivity
Storage Problem
It is not possible for a device to store all the
sensor data locally in the device as they don’t have that kind of storage. For
e.g. a Traffic camera can generate thousands of mega pixel images. It is not
possible to store everything in the camera. Also what if the camera is damaged
or is stolen, and then the data is lost. So the common approach is to transfer
the data to the backend database where it gets stored.
But
the storage to a backend server is not as simple as it looks like.
It
is very clear that the data is heterogeneous in nature (images, video,
sensor readings). Also the data is mostly unstructured like the videos, log
files.
We
cannot use the Relational database for storing them as they don’t handle the
unstructured data very efficiently.
All
you can do is save it a blob, which means you cannot query the data that is
inside the blob and hence it will not allow you to retrieve the data for
analysis. Also there is a limit to the
amount of data that can be placed in a RDBMS. Imagine you have collected video
streams from the traffic cameras across the city like Delhi. We can safely
assume there will be at least 100 traffic cameras. Assume that each camera is sending
one image per second. If the image is say 500KB, then you have 100*500KB=50MB
of data per second. Multiply by 24*3600 seconds which makes it
50MB*3600*24=4220GB or 4.3TB of data per day!!! This is a huge data to be
stored on a single computer. The storage
disk will run out soon. Also to query such a massive data will take forever
with a regular SQL solution. The obvious choice is data federation over
cluster. But the regular RDBMS solution does not scale well for clusters.
To
cater to the problem of heterogeneity and massive data NOSQL databases come to
rescue. We will discuss on that in the later section
Battery Issue
One important consideration
is the energy or the battery consumption in the devices. In most remote sensors
it is common that they are running on battery power (they might not be
connected to the constant power supply). If they are constantly transferring
the data then their battery might soon drain out and you will stop getting the
data from the sensors. So the energy management has to be in place
One approach is to send burst of data at specific
intervals to the central server instead of sending data constantly. This will
help in saving the battery cost involved in remaining connected.
But the burst approach is not feasible all the time.
In some scenarios like traffic cameras where the real time data has to be
transmitted from one camera to control room, the sensors are always online and
in transmission mode
Or intelligent machines like the wind turbines which
control themselves based on the wind speed, rains, transmit data to a central
computer and get the analytics back.
Or the sensors in a plant are in constant
transmission mode. They are all active sensors and they need to be always
connected. But the passive sensors can defer the connection on need basis. In
such scenarios constant source of power is inevitable
Connectivity
Another critical problem is the unreliable or
limited internet connectivity. There are scenarios where the sensors are
deployed in remote locations which have a very limited / no internet
connectivity. One approach to get data from these sensors is using a temporary
connection once in a day/week depending on the criticality of data and amount
of storage at sensor. This will get the sensor connected to network and the
data can be transferred. Some of the techniques that are being tried are Google
Balloon where they use balloons to provide the connectivity (http://en.wikipedia.org/wiki/Project_Loon).
There have been attempts where a vehicle was used to
connect the nodes (data mule)
(Data
MULEs: Modeling a Three-tier Architecture for Sparse Sensor Networks
This approach exploits the presence of
mobile entities (called MULEs) present in the environment. MULEs pick up data
from the sensors when in close range, buffer it, and drop off the data to wired
access points. This addresses the connectivity issue and at the same time it can
lead to substantial power savings at the sensors as they only have to transmit
over a short range (Just to the vehicle)
This model of data mule is more suitable to the
passive sensors, but for the active sensors like the traffic cameras the
connectivity has to be available all the time
IOT based solution is the one where you are able to
analyze the data coming from the machines (devices) to predict something useful
for the existing workflow.
Based on my understanding of the IOT and its usage,
there is a technical infrastructure that needs to be put together. I have tried to draw a rough diagram of what
can be called as a IOT stack
Devices/Sensors:
They are the physical sensors/devices
that make the core of the IOT. It is assumed that the sensors or the thing is
able to gather some data and that data can be sent to some central backend
sensor for detailed analysis. Some common IOT devices are temperature/pressure
sensors, routers, cameras
Communication Protocol:
Since the devices are
constraint by the battery power, bandwidth availability and memory, the regular
TCP/IP protocol is too bulky for practical use. The main problem with the
standard communication protocols is that they have an overhead of sending large
headers in each of the packet. They were designed for the PCs, workstations
where the memory and bandwidth is not a major issue. In a TCP packet over IPV6
there are huge number of bytes used for sending header information and the real
data is very less. This is not an optimized solution as the sensors will burn out
the battery sending the heavy headers rather than the actual data. For this
reason new protocol standards have been proposed like 6LoWPAN (http://en.wikipedia.org/wiki/6LoWPAN)
that have less overheads.
BigData based storage:
We had discussed earlier that the
sensor/IOT data needs is huge and is heterogeneous in nature. In this section I
will try to elaborate more on this aspect.
For e.g. we can have the sensors
producing the data in the following formants
e.g.
Sensor1Reading => 10.5,3
(temperature, pressure)
Sendor2Reading => img
Sensor3Reading=>Log
In case of the RDBMS it is all about
schema and tables. No single schema can satisfy the 3 different kinds of data.
One way is to store it as a blob
So we can have the schema as
Sensor1, blob1
Sensor2, blob2
Sensor3, blob3
But the problem is that we cannot write
a query to say give me the average temperature recorded by the sensor1 as blobs
are considered as binary data and the sql query will not run into the content
of the blob. One of the key reason for using a database is to be able to efficiently
query the data which is obviously not getting satisfied in this case
In this situation the big data or NoSQL based
solutions come to rescue
In
very simple terms NoSQL means a schema less database, where the storage is
mostly on the basis of key value pairs. The keys and values can be anything. So
the above data could be strored as
e.g.
Sensor1Reading
=> 10.5,3 (temperature, pressure)
Sendor2Reading
=> img
Sensor3Reading=>Log
The
big data allows the user to query in the values of the keys, so that resolves
the querying limitations of blob.
Another
important advantage provided by the big data solutions is the ease of changing the
data set or the schema stored. For e.g. if in future you want to add a location
parameter with sensor1 value, it can be done easily. Basically just start
adding the geolocation in the value while storing it in the bigdata so the
entries will look like
Sensor1Reading
=> 10.5,3,loc1 (temperature, pressure,locationdata)
Sendor2Reading
=> img
Sensor3Reading=>Log
Imagine
the same thing in case of the RDBMS. You would have to add a new column in the
table, and then see if the foreign key constraints are not violated. This
problem will explode if you have to keep adding new fields.
The
query time is another major reason for not considering the RDBMS solutions. In
case of RDBMS the search happens in a linear fashion. So if it a million record
table and your data of interest is at the millionth row it will take
million*time to seek a record amount of time. The disk I/O will have a bottle
neck and surely the retrieval will take huge amount of time.
The
main problem is that RDBMS does not efficiently support parallelization for
query execution. Generally they don’t even work on cluster based solutions. On
the other hand big data solutions fundamentally supports cluster based data
federation. Any query will be executed in parallel on multiple nodes and the
search time is drastically reduced.
Basically
you need to answer the following questions before choosing the right database
solution
- How to define the data (schema). Can it support unstructured data?
- What is the limit of data storage? How to scale the data. Can I add more disks and distribute the data.
- How easy is it to change the schema of the data given the fact that I might add new sensor which is producing different kind of data
- How to query the data with accuracy and efficiency (Query model) . Without the ability to query the data it is not more than the file folders. Also what is the time involved in querying. Some of the interesting queries can be
·
Query all the plate numbers
of the cars which were speeding above 80KM/Hr between the time period of
20-09-2013 to 24-09-2013 in the Cannought Place area in New Delhi.
o so we are interested in the traffic surveillance
records coming from camera & speed sensors. There will be an image
processing involved since we are interested in the number plate
·
Query the flight path of the
flight number 123 on 20-10-2008
o Since the data is 6 years old so search can take
some time. This the challenge that BigData solution have to mandatory resolve
before being even considered as a potential solution
·
Query the Nearest 10 mobile
phones location from the GPS position
(22,77)
o Need to perform lot of distance calculations between
1000’s of mobile phone in that region and yet return back the result in few
seconds
·
Find out my facebook
friends.
The query looks simple but
you could be 200million name in the
database. A RDBMS query will work sequentially. At any speed, it will take few
minutes to hours to just find your name in the database, then to perform join
going to bring down the Facebook server
·
Find out if the person has
already completed the scans for Adhaar Card.
o This will match with Iris scans and Finger Scans.
Even if 1Crore (10 million) people have been registered imagine the computing
power required. An surprisingly the matches are verified in less than a minute.
Some of
the popular NoSQL databases are MongoDB (http://www.mongodb.com/nosql), CouchDB (http://cassandra.apache.org/), Cassandra (http://couchdb.apache.org/), HBase (http://hbase.apache.org/)
Some good references to understand big
data solutions are:
2. NoSQL Distilled: A
Brief Guide to the Emerging World of Polyglot Persistence, Pramod J.
Sadalage, Martin fowler, http://www.amazon.in/NoSQL-Distilled-Emerging-Polyglot-Persistence/dp/0321826620
3. A Survey on Cloud
Database, Deka Ganesh Chandra
4. A comparison between
several NoSQL databases with comments and notes, Bogdan George Tudorica,
Cristian Bucur
5. Cache and Consistency
in NOSQL, Peng Xiang, Ruichun Hou and Zhiming Zhou
6. A Storage
Infrastructure for Heterogeneous and Multimedia Data in the Internet of Things,
Mario Di Francesco, Na Li, Mayank Raj, Sajal K. Das, 2012 IEEE International
Conference on Green Computing and Communications, Conference on Internet of
Things, and Conference on Cyber, Physical and Social Computing
7. A Storage Solution for
Massive IoT Data Based on NoSQL, Tingli LI, Yang LIU, Ye TIAN, Shuo SHEN, Wei
MAO, , 2012 IEEE International Conference on Green Computing and
Communications, Conference on Internet of Things, and Conference
on
Cyber, Physical and Social Computing
8. Social-Network-Sourced
Big Data Analytics, Wei Tan, M. Brian Blake, Iman Saleh,Schahram Dustdar, IEEE
internet Computing 2013
Analytic Engine:
All
the use cases of IOT indicate that analysis of the collected data is the key to
make intelligent decisions and provide optimizations.
We
have the data in the big data solution but it is of not much use unless we are
able to analyze the data and find out some patterns to draw conclusions.
But
given the fact that the amount of data can run into Peta bytes, distributed
computing is inevitable.
For
e.g. Let’s say that from the web server logs stored in the big data solution we
need to find out which keywords have been searched the most.
The
logs can run into peta bytes (imagine the logs that Google collect about the
user search queries). A naïve approach to analyze these logs is to parse the
log using some regular expressions for capturing the keywords, then counting
them and creating aggregations.
This
is quite time consuming and will take a huge amount of time if run on a single
node. This is a classic case of distributed computing where the data can be distributed
on the different nodes and then analysis can be run only on that smaller set.
At the end the results can be collated together.
These
kind of problems fall under a paradigm called map-reduce. In order to ease the
work of users there are frameworks available that help the users to write the
map-reduce jobs and get their analysis done.
Hadoop
framework is one of the most popular map-reduce framework that provides the
capability to process large data in highly parallel fashion. It is capable of
reading the data from an existing database source, stores the data in its own
filesystem called HDFS and then process the data using a map reduce algorithm.
Map reduce algorithm basically defines what needs to be done on the data. For e.g.
user want to run regex on each line of the log file, or image processing
algorithm on each image file stored in the database. Map reduce will make sure
that all the desired algorithm is applied to each unit of data in a distributed
fashion.
Some bigdata solutions like
MongoDB, CouchDB, Cassandra have inbuilt capability to provide map reduce
functionality while other require Hadoop kind of frameworks to process
Details
about map reduce and Hadoop outside the scope of this article. There are good
references and books available to get started.
1. Hadoop in Action, Chuck Lam, Manning Publications, 2011
2. MapReduce Design
Patterns, Donald Miner & Adam Shook, O’reilly Publications, 2012
3. Hadoop The Definitive
Guide, Tom White, O’reilly Publications, 3rd edition, 2012
Integration with the cloud:
Cloud
integration is a fundamental requirement for the IOT based frameworks.
The
cloud provides 3 main advantages in this perspective
1. Storage: With the storage requirements running into petabytes
it is not feasible to buy the personal storage solutions for most of the
companies. Cloud provides a feasible solutions by providing storage of any
order in an affordable fashion
2. Computing: As we saw that analysis might be heavy weight with
the amount of data to be analyzed and the algorithm to be run on each unit of
data. A single node cannot be used to run the analytics. At the same time to
build a cluster solution is expansive for most organizations. The cloud
provides on demand computing power by providing multiple nodes on which the
computing can be distributed
3. SOA: The data and the analytics might be required by
multiple modules and devices. Cloud provides an easy way to create services
that will provide the data (data stored in the big data database) and the
analytical results though REST calls
I have tried to give high level overview of what is IOT, how is it useful in improving the processes and saving operational costs, what is the technological landscape that needs to be understood before creating a useful solution based on IOT.
I would strongly recommend to read the references
that have been included in various sections to get the in depth understanding
of the topic.
1 comment:
Great article Rohit.
Got me thinking on a couple of things. 1) The Power of One is a saying not only for aviation but for all business. Combining 1% saving from all industries is predicted to save massive amount of energy and efforts http://inside.gesoftware.com/industrial-internet is a great source for this.
I have been looking in to Neo4j which seems a bit similar to the NoSQL solutions you are mentioning. Neo4j call them self a graph database. Can you clearly the difference for me (if there is any) between Neo4j and say Cassandra?
Post a Comment