1. Introduction
We generally look at big data as a new buzzword, which will give a cool
tag to our products. So inadvertently, we just use some of the popular
technologies to make sure that we have some kind of big data analysis in our
application. It may not always be done correctly or for any intended use. The
same thing I remember when web2.0 took off. Every company wanted to be a part
of the bandwagon without knowing the real meaning of the technology. Using NoSQL,
Hadoop is not the big data. We need to
understand the use cases of big data, its advantages and pitfalls before we can
actually make any real value out of it.
There are numerous references available on this topic most of which talk
about the 6 Vs (Volume, Variety, Velocity, Variance. Veracity, Value) of big data
and the challenges they pose. They will then delve into NoSQL, Map Reduce and
related issues. But that is quite
theoretical and difficult for a person to appreciate who is just trying to get
introduced to this topic.
I came across this book [8] last month that seems to cover big data
concepts in a more pragmatic way using lots of case studies and real world
examples. This book is one of the very few ones in the market, which talks
mostly about value proposition that big data brings to the table, rather than
delving into the nitty-gritties of the technologies. You will start
understanding why Big Data can be a game changer in the near future. Cloud,
NoSQL, Map Reduce, Hadoop are all enabler, but you need to know why you want to
use them and when to use them.
In this article I am trying to summarize the key concepts that the book
talks about and the various case studies that it has taken to drive the point
across. I have also tried to add my inputs wherever I felt the description
needed more explanation or I had some extra information.
Wherever I use “authors”, it means I am referring to a section from the
book, while in case of additional information I have provided the related
references at the end.
According to the authors, there is no rigorous definition of big data.
In general if the volume of data to be analyzed is so huge that it cannot be
stored in memory of computers used for general processing and cannot be
analyzed using conventional analytical tool we can call it as a big data
problem.
Authors begin the book with giving us a brief tour of reality. In todays
world we cannot employ the traditional methods of collecting data and
performing the analysis simply because the data is huge and noisy (messy). It might
not always be possible to calculate the exact value due to the noisy nature of
data, but trends/predictions is the next best that we can strive for. In this
article I will discuss that predictions or correlations is more valuable than
finding the exact value. At its core big data is all about probabilistic
analysis to come out with some predictions.
In this article I
will first discuss the challenges we have at our hands when dealing with big
data analysis. Then will discuss about how data if collected in the right form
(datafication) can help us to analyze it and use it in the way, which was
unimaginable sometime back. I will then discuss on how big data affects the
valuation of a business. Finally I will discuss about the risks that big data analysis
poses in its current state.
2. Challenges
2.1 Challenge#1: Lots of data
is available to Analyze
We today have the ability
to collect, store and analyze vast amount of data. We are collecting data from
sensors, cameras, phones, computers and basically anything that can connect to
the internet. We have so much of data to work with that we no longer need to
limit our analysis to a sample
of data. We have the technology to gather every thing possible and then make
our predictions. The conventional approach of taking samples from the data is
not required anymore since we have the affordable technology to work with large
set of data.
Sampling
is a good and efficient way to perform the analysis but the accuracy of the
results depends largely on the sample set. If the sample set is not random then
the results could be biased. It is also very difficult to find the trends in a
sample set since the data is discrete and not continuous. When you have large
amount of data your predictions are more accurate. Previously we did not have
the technical ability to work with all the data so we tend to take samples but
today with the affordable storage technology and massive computing power we can
have the sample space of N=ALL, i.e. take all the data for the analysis.
Often, the anomalies communicate the most valuable information, but they
are only visible with all of the data.
-Google flu predictions are relatively accurate since it is based on
billions of search queries it gets per day.
-Farecast a company that predicts if the airfare for a seat is going to
go up or down, analyze about 225 billion fight and price records to make that
prediction [8,10].
-Xoom, a
firm that specializes in international transactions, sometimes back found out a
fraud in certain Card transactions originating from New Jersey. It was able to
do that by finding out a pattern in the transaction patterns. This was evident
only when they analyzed the entire pattern. The random samples would have not
revealed it. Sampling can sometime leave the important data points and we may
not uncover the patterns.
2.2. Challenge#2: Messiness of data
Highly curated data is a thing of past when only a small amount of
information was at hand so you wanted to use it for getting the exact results
in the best possible way.
But now we are dealing with huge amount of data (Petabytes of data). Messiness
(error, noise) may crop in due to many reasons
- Simple fact that the likelihood of the error
increases as you add more points.
- Inconsistent formatting of the data from different
sources
- When collecting data from 1000 sensors, some sensor
might give faulty readings adding to the errors
- Collecting data at higher frequency can result in
out of order readings due to network delays
Cleaning
is an option only with small amount of data where we could look for errors,
formatting issues, noise and remove them before the analysis was done ensuring
the correctness of the result. But in today’s world when the data is huge, it
is not feasible to clean up all the data with errors. Also the velocity of the
data is so high that it will become a highly challenging task to work in real
time to guarantee the cleanliness of data (Basically not possible).
But then book suggests that More
data beats the clean data
Author
argues that even if we leave some degree or errors and noise in the data, there
effect will be nullified since we are going to use huge number of data points. Any
particular reading from the sensor may be incorrect but the aggregate of many
readings will provide a more comprehensive picture and nullify the effect of
the erroneous reading. Instead of exactness analysts should strive for “Is it Good enough?” We can give up a
bit of accuracy in return for knowing a general trend.
Sometimes the messiness of data can be used to ones
advantage as Google spell check has done. Its program is based on all the
misspellings that users type into a search window and then “correct” by
clicking on the right result. With almost 3 billion queries a day, those
results soon mount up [10].
Google also used the same concept in their translate service which can
translate text from one language to another. These translations are based on
the probability of occurrence of the word in a particular order, how a
particular word in English is generally used. This was based on the analysis of
huge amount of data that they have collected. There will be some documents that
are not good in language but they work on probability and not on exact numbers,
so such kind of messiness is ignored.
Messiness aids in creating big datasets, which are ideal for performing
a probabilistic analysis.
There are several other examples where messiness of big data got
neglected in the probabilistic analysis and collection of all the possible data
helped in improving the existing process. British Petroleum could find out the
effect of oil on corrosion of pipes by collecting the pipe stress data from
sensors over a period of time. Some of the sensors might have given wrong data
as physical devices tend to undergo wear and tear but with data coming from
large number of sensors, this noise would have insignificant effect on the
calculations.
Inflation index that is used by policy makes to decide on interest
rates, salary hikes is calculated by gathering the price index for various
commodities from different parts of the country. This is done manually and is a
highly time consuming and expensive task. Decision makers need to wait for a
long time to get the results. PriceStats, employed techniques to scrap data
from different websites about the commodities prices and trained their
algorithm to analyze that data for calculating the inflation, making the process
of calculating the inflation price index largely automatic and extremely fast.
Again the web-scrapped data is not always correct or updated, but with
probabilistic approach such errors can be neglected. Also people might be
interested in knowing the trend and not the actual values on a daily basis so
the messiness of data is acceptable.
Consider, Flickr about “tagging”
photos online, the tags are irregular and include misspellings. But, the volume
of tags allows us to search images in various ways and combine terms that would
not be possible in precise systems [1]
2.3.
Challenge# 3 Correlations and not Exactness
With so much of data to analyze and with unavoidable messiness, the best
we can strive for is to find out some trends. Exact answers are not feasible
and in most of the cases not required as well. For e.g. based on the machine
data we can predict that the engine is most likely to fail after 2 months but
the exact date is not possible to calculate as the data is not 100% accurate.
Predictions or correlations will tell you what is going to happen with a
probability but not why it will happen. It helps businesses “foresee events
before they happen” allowing them to make informed and profitable decisions. Usually
the correlations are analyzed by joining data with one or two data sets.
If a user watched Movie A, what is the chance that he will like Movie B.
If the user has bought Book A, will he be interested in Book B. Amazon, Netflix
were the firsts in the space of recommendations which is based on deriving
correlations between the item the user has bought and similarity score between
the other items available in the store.
There are lot of case studies that the author have given:
By analyzing the data, WalMart found that there is a correlation between
the buyers buying habits and the weather conditions. People prefer to buy some
type of products in specific weather. WalMart use this analysis to update their
inventory with different set of products accordingly.
Google predicted flu infected person by finding the patterns in the
search strings. The person infected with a particular disease is more likely to
search a particular set of keywords.
FICO, invented a Medical adherence score, it measures how well we handle a drug
prescription. The number tells about our likelihood of taking the drug
prescriptions as they were written for us. The algorithm figures out who does
or doesn't fill their prescriptions, the more adherent we are, the higher the
score. This score is based on different personal parameters like what does a person
buy, where does he lives, what are his food habits etc. FICO seems to have
found a correlation between the person habits and his adherence to follow the
prescriptions. This data is valuable for the health insurers, who can
concentrate their follow up with the people who have lower FICO score and not
bother others. [2]
AVIVA can predict from the personal data of the customer, if he suffers
from a medical condition or not. This helps it to avoid usual medical tests for
customers who have been predicted healthy
Retail stores like Target uses the female customers buying patterns to
predict if the person is pregnant or not. This helps them is giving coupons
targeted for purchasing various products for different stages of pregnancy.
Authors argue that Good correlation is the one where change in one
variable is going to affect the other while a bad or not so useful correlation
is when the change in is one variable is not going to affect the other variable
significantly. The real challenge is to find the correct variables to establish
the correlation.
In one of the case study authors discussed that New York city public utility
department can predict which Manhole is going to explode based on the Age of
the manhole and prior history of explosions. The correlation of a Manhole likeliness
to explode with its age and past history is very intuitive, but to arrive to
this conclusion by analyzing huge amount heterogeneous data was a big data
challenge.
3. Datafication
Big data is all about analysis of large chunk of data to come up with
something useful. The data is the key input to any big data use case. Even
though we say that the data can be messy, incomplete, heterogeneous, but one
thing is fundamental that is it should be in a form that can be read/analyzed
by the computers. Process of converting data in a form that it can be used for
analysis is termed by authors as datafication (Make the data quantifiable).
There is a fine difference between digitization and datafication.
Digitization is to convert the data in a form that can be stored in the
computer or storage solution while the datafication is to save the data in the
form that can be analyzed by the algorithms.
Generally people talk about digitizing every written document (scanning
and storing it to a disk) so that you can retrieve it whenever you want. But
the data will have little value from the analysis point of view if it remains
only a set of images, as the content from those documents is not in the form
that we can analyze. We cannot create indexes on the images and so cannot
search for the text inside the images. Datafication is the next step, which
means to transform these documents or any information in a way that can be
analyzed by the computers. Google books
(http://books.google.com/)is a classic example. They first scanned millions of
books so that they are available for online reading that is digitization, they
then applied the OCR technology to extract the words from the chapters and
stored them in a form so that search index could be created this is
datafication.
There are some case studies, which emphasize on the importance of the
datafication.
Maury’s work of extracting data from old sea logs to find shorter and
safer sea routes. This is one of the first examples of large-scale datafication
of content. The data that was available in the old navy logbooks was manually
extracted and converted to a form that could be analyzed. When tabulated
properly along with the added information like the weather conditions, tidal
behavior helped in finding shorter routes, which saved lot of time and money. Even though this entire work was done
manually and does not fall under the realm of modern big data analysis but it
suggests the importance of data in unveiling unknowns.
Datafication of the vital signs of person’s
health has become easy with the advent of the health bands. They collect the
data about the vital signs and store them to a server where the analysis takes
place. Person can get valuable trends about his health from these inputs
Datafication of location
This is the ability to track the location of a person or a thing in a
consistent manner using GPS has led to various interesting use cases. In some countries,
the insurance premium depends on the place where you are driving and when you
are driving the car. The place where the theft is high, the premium is likely
to go up. You no longer have to give a fixed annual insurance fee based
primarily on the age, sex, past record.
UPS uses the GPS tracker in the trucks to get the data about the various
routes. They can quantify the things like which route took most time, how many
turns, traffic signals, areas prone to traffic congestions (where the truck
moved slowly). This data is used to optimize the routes so that the trucks can take
shorter routes with less number of turns and traffic congestions. This has
helped in saving millions of dollars in fuel cost.
The ability to collect one’s location through the smart phones, home
routers is turning out to be very useful. The target specific ads are shown to
the user based on where he is and where he is predicted to go. The locations of
people’s Smartphone is used to predict the traffic situations and plotted on
the Google maps.
Datafication of relationships
Social interactions are something, which was always there but was not
formally documented. These relations can be in the form of friendships,
professional links, general emotions.
Facebook enabled the datafication of social relationships i.e.
friendships in the form of social graph. This graph can give you information
about your friends, FOAF, your posts, photos, likes almost everything that you
do on Facebook. The best part is that the social graph can be machine analyzed
by the algorithms to find out the details about a person and his interactions.
The credit card companies use social graphs before giving the credit cards,
Recruiters check Facebook profiles to judge a persons social behavior [7]
F-commerce is a new business paradigm that is based on if a friend has
bought the product; his friend is likely to buy it.
Bing boosts the rankings of the movie theaters, restaurants while
displaying the search results based on the likes and posts of user’s Facebook
social graph. There predictions are based on the premise that if your friend likes
a theatre or a restaurant you are likely to enjoy that too.
Twitter datafied the sentiments of the people. Earlier there was no
effective means to collect the comments/opinion of people about a topic. But
with twitter it is very convenient. Sentiment analysis is a big data way of
extracting information from the tweets.
Hollywood movies success/failure predictions are now guided by tweets
about that movie. All media channels these days monitor the tweets on various
topics to find out people opinion about various issues.
There has been research on how to perform real time sematic analysis of
the tweets to find out breaking news [3]. Tweets are analyzed to find out
traffic jams, cases of flu, epidemics etc. [4,5]
Investment bankers are using tweets to sense the mood of the people about
a company and its policy. Favorable tweets means people are more likely to
invest in the company.
LinkedIn has pioneered the datafication of the professional network.
They represented the professional relationship in the form of a graph where you
can see how you are connected to another colleague. You can get recommendations
of your work from colleagues. You can find out what kind of work is being done
by colleagues in different companies and try to make career decisions. The
professional graphs are these days heavily used by the recruiters to know how
the person is judged by his colleagues, his skillsets, how well connected he is
etc. This was not possible earlier by just scrutinizing a resume[7].
4. Value of Data
Value of data lies in its use and potential multiple reuses. With so
much of data being collected, new previously unthinkable uses are coming up.
More then the primary use of the data it is the secondary or the tertiary use
that is more valuable. Sometimes the value can only be unleashed by combining
multiple data sets together--Data mashups (joining 2 or 3 data sources
together). These kinds of data joins yields reveal correlations, which are
quite surprising and business impactful. It is this untapped potential that is
making big data analytics so much lucrative and demanding.
The companies like Zillow combine property data with local businesses data
to calculate the walk score, which tells the potential buyers how walkable the
daily stores, schools, restaurants are from the particular property.
The data collected for one purpose can be used in multiple other uses.
The data value needs to be considered in terms of all the possible ways it can
be employed in the future and not simply how that is being used in the present.
-Google collected data for street view but used the GPS data to improve
the mapping service and for functioning of its self-driving car.
-The ReCaptcha service from Google is another example where the data
collected for one purpose is being used to serve multiple uses. The ReCaptcha
images are generally from the Street View, and Google books which were not
successfully digitize by the computer. The user when enters the text helps
Google to digitize the images. Google is using this for improving maps, books
and street views.
-Farecast harnessed the data from previously sold tickets to predict the
future price of the airfare.
-Google reused the search terms to uncover the prevelance of the flu.
-SEO or search engine optimization is primarily driven by the search
terms.
-Companies make use of the search queries to find out what the users are
looking for and then may be strategize their products
-Google reused the data from the books to feed into their translate
service to improvise the translations
-FlyONTime.com combines weather and flight data to find out the delays
in flights at a particular airport.
Sometimesa data that is otherwise thought to be useless like typos or
clicks can be used in innovative manner. This has been termed as Data Exhaust.
There are companies which use the click patterns on the website to find out the
UI design issues. Google is a pioneer in using this extra data to improve upon their
services.
A Study conducted by GE suggests that
service data of aircraft engine can provide insights that can result in 1% fuel
efficiency which has a potential to save about 1Bn$ per year. This has been
referred to as Power of 1.
Today we can collect sensor data from the
machines. The data can be continuously monitored for any anomalies or a pattern
that may indicate that a particular part is about to get wear down. The service
engineers can proactively visit the site and take corrective actions. Rather
than providing reactive maintenance, service engineers can provide proactive
care to the engines. The servicing can be on need basis rather than as per AMC.
This also means zero unplanned downtime. The maintenance happens when the
part is about to get wear down and not when it has stopped working. The data
from the engine will create a service request if the analytics finds that the
part requires some servicing [6,9].
Any company’s worth in todays time is not only its book
value (physical asset), but also on the kind of data it generates and
maintains. The value of data is again largely depends on the secondary usage
potentials. Major reason for aggressive valuation of Facebook, whatsApp,
Twitter is because of the massive user data that they hold.
5. Risks
With the huge amount of data and the ability to analyze it for different
purposes have resulted in some negative and dangerous scenarios too which if
not handled properly can cause panic among people. Authors have highlighted 3
main Risks
Privacy:
Google through the Street View can show people houses, cars. This can
give burglars an idea of the person’s wealth. Even though Google on request
fuzz out the images but still it gives an impression of something valuable to
hide.
Based on the search strings from particular company employees, the
competitors can predict the kind of research they are performing.
Google searches Gmail inbox to find out your hotel bookings; your travel
dates and then sends you reminders before your travel. It also plots this data
on the Google map, so that when you search for that hotel, it displays you the
dates of your reservations against that. This is a serious privacy violation
but Google finds it a business opportunity.
Propensity: Big data predictions can be used to judge a person and
punish him. Bank may deny a loan to the person just because the algorithm
predicted that the person will not be able to pay loan. US Parole department is
using the predictions whether the person is likely to commit the crime in future
to make the parole decisions. This might be correct sometimes but in some cases
it might restrict the right person to get what he should have got.
Dictatorship of data: Overreliance on the data and predictions is also
bad. The quality of data may lead to biased predictions, which can be wrong and
end up to catastrophic decisions. It now appears that America went to Vietnam
War because of few high officials obsession about the data.
6. Conclusion
I have tried to give a summary of the Victor Mayer and Kenneth Cukier
book [8] in my best possible ability. I tried to bring in some more examples
and my own understanding about the subject while writing this article, which I
thought, might give some more clarity on the topic of discussion. Even though I
could see mixed reviews about the book on the net, but I personally feel that
this is a book for everyone. The book is excellent for someone who wants to
know what is big data without getting into technical aspects (mostly managers,
architects, principle engineers). The book also acts as a reference for someone
who is already working with big data but do not understand practical value of
the technology (mostly developers).
There are few other
sections of the book, which discuss, how as an individual/business one should
find a place in the value chain of the big data. I thought this needs some
in-depth discussion and may be a good idea to write another article focusing on
big data value chain
References
- http://www.data-realty.com/pdf/BigDataBook-TenThings.pdf
- http://patients.about.com/od/followthemoney/f/What-Is-The-FICO-Medication-Adherence-Score.htm
- Predicting Flu Trends using Twitter data, IEEE Conference on Computer Communications Workshops (April 2011), pp. 702-707 by Harshavardhan Achrekar, Avinash Gandhe, Ross Lazarus, Ssu-Hsin Yu, Benyuan Liu
- TwitterStand: News in tweets, by Jagan Sankaranarayanan , Benjamin E. Teitler , Michael D. Lieberman , Hanan Samet , Jon Sperling
- Semantic twitter: analyzing tweets for real-time event notification In Proceedings of the 2008/2009 international conference on Social software: recent trends and developments in social software (2010), pp. 63-74 by Makoto Okazaki, Yutaka Matsuo
- http://www.ge.com/docs/chapters/Industrial_Internet.pdf
- Assessing Technical Candidates on the Social Web, IEEE Software, vol.30, no. 1, pp. 45-51, Jan.-Feb. 2013, by Andrea Capiluppi, Alexander Serebrenik, Leif Singer
- Big Data: A Revolution That Will Transform How We Live, Work and Think. Viktor Mayer-Schnberger and Kenneth Cukier, John Murray Publishers, UK,2013
- http://rohitagarwal24.blogspot.in/2013/12/a-software-engineer-perspective-on-iot.html
- “Data, Data Everywhere”, Kenneth Cukier,The Economist Special Report, February 2010, pp.1-14