Sunday, June 19, 2011

Can Semantic web improve the shortcomings of Keyword Search?

Currently most of the popular search engines are based on the keyword search algorithm which means, they try to search for the pages which contain the keywords entered by the user and provide you the results. There are certain problems with the keyword based search mechanisms which can be understood by taking few common scenarios (I have taken few examples from the book Social Networks and the Semantic Web by Peter Mika. I have added my explanations to make these scenarios easy to understand):
1. Who is Frank van Harmelen? (Basic keyword based search)
Suppose I want to query this with the search engine like Google. And I entered the keyword Harmelen. Google has no idea about my intentions it will just do a keyword search and return me the results that will be about persons with some portion of name as Harmelen, some product on Amazon whose name is Harmelen and so on . Basically if we look into the conversation mode then the query that we did and the response from the search engines will look like
Q: Who is Frank van Harmelen?
A: I don’t know but there are over a million documents with the word “harmelen” on them and I found them all really fast (0.31s). Further,you can buy Harmelen at Amazon. Free Delivery on Orders Over 15.

Upon closer inspection the problem becomes clear: the word Harmelen means a number of things. It is a name of few people (Frank van Harmelen or Mark van Harmelen etc). Harmelen is also a small town in the Netherlands and the place for a tragic train accident. There is also some products on Amazon with the name Harmelen. So basically the search engine has thrown a wide range of results.

We will then start adding some more information to the keywords and will specify the full name that we are interested in “Frank van Harmelen”.
This will reduce the result set but still there are lot of extra results. For e.g. the user might be interested in Frank van Harmelen, Professor of Vrije University. But the result will have the web pages containing any Frank van Harmelen

The user can further refine the search results by typing more keywords “Frank van Harmelen professor of Vrije University” which will return pages containing the correct Frank van Harmelen.
So if we analyze the refinements we are basically providing more and more information in the keywords to guide the search engine to return the correct results.

Still there are some problems in the results. There can be multiple scenarios where the even though information on the page is relevant to the Frank Van Harmelen but we are not getting them on the search results. There might be webpages where the name is not mentioned as Frank van Harmelen but still they have been written by Frank or there are web pages who are referring to Frank van Harmelen as “he” or Frank might have written some articles/books which the webpage is referring as FVH98.
The search engine will miss out these pages which can be quite relevant to the person who is querying since it has got no information that the pages related to Frank Van Harmelen

There are some common queries that user wants to perform but are impossible with the current searching mechanism. For e.g. a movie buff might want to search “Give me all the movies which were directed by Steven Spielberg and acted by Harrison Ford”.
The search engine will just return the pages where both of them are listed together. The search cannot understand the meaning of the web pages and cannot figure out if the page is indicating that two of them have worked together or it’s an information about a Movie party where two of them were present.
Only efficient way is to manually break the query as “Find movies by Harrison Ford”, “Find movies by Steven Spielberg”. Now manually check the crew members in both the set and finally come up with the result “Indiana Jones”

2. Image search.
A user might be interested in searching for the pictures of Paris city. On Google image search if you type “Paris” you will get more results related to Paris Hilton than Paris city. The problem with the image search is more profound then with the regular search. The problem is because associating the photos with the keywords is much more difficult than simple looking for the keywords in the text of the documents. It’s very easy for the human to find out that the results are not correct (Paris Hilton and Paris city are different things) but the computer can do a visual verification of the results. Automatic image recognition is currently not a matured field.
The images that were shown basically correspond to the pages where the images were placed and the pages contained the keyword “Paris”.

3. Find new Music that I like

These kind of queries is at an even higher level of difficulty. From the perspective of automation, music retrieval is just as problematic as image search. The search engine can try to avoid the problem by not going into the content of the music but only get the clues of the performer or the genre from the web pages that have links to the music files. But there are some practical problems. The music is very fast moving so the content of the indexed pages might have changed. The search engines typically index the Web once a month and therefore too slow for the fast moving world of music releases.

The user can give the query like “Give me all the latest sad English songs” or “give me songs which are hip hop” so we are trying to query for songs from different artists and filter them on the type of song.

The search engine has no way to find out the content of the music and what is the meaning of the music.

Ideally a user would like that the search engine fetches the music related information from user’s online playlist that the user maintains and based on the type of songs the search engine can return them the latest songs.

4. Tell me about the music players with a capacity of at least 4 GB

So basically this is an ecommerce query. The user is looking for a product with certain characteristics.
The basic problem is that translating this query from natural language to the Boolean language of search engines is (almost) impossible. We could try the search “music player” “4GB” but it is clear that the search engine will not know that 4GB is the capacity of the music player and we are interested in all players with at least that much memory. The query will just return the pages containing the keywords as “music player” and 4GB. It cannot even make a choice if iPod, mp3 player should be considered as music player or not. In simple words if a web page is talking about a 4GB iPod, the page will not appear in the result set. The search engines don’t know what is a music player or a mp3 player. It is just a program capable of searching keywords from billions of pages across internet.

Also even if we somehow add modify the algorithm to make the search engines understand that iPod, mp3, mp4 are all kinds of music player the task of reading the web page to find out the capacity is quite tedious and erratic. The search engine will have to figure out where the information related to capacity has been mentioned. It has to parse the HTML page to find the information. Parsing the HTML page to find information cannot be generalized simply because every web page is created differently. Some html pages can have presentation information inside the page some can have a separate css file. Some of the HTML pages might be generating the information dynamically so the indexing is also not possible.
In short with the current search algorithm the query “give me all music players with a capacity of at least 4 GB “ is impossible to make

5. Tell me where I can find the cheapest iPad 16GB

The query is little different than the above since here the user has told specifically that he is looking for iPad 16 GB. So the search engine can go to Amazon, Google Froogle, ebay etc and get you the specific product. But the catch is the word “cheapest”
The user wants only the cheapest price. The search engine has to again parse the price related information. As discussed above the parsing of HTML pages for price information is very difficult. The price might be shown as “Total”, “Cost” etc. Also the various sites can give the value in different currency so 400$ < 380 euros but the search engine has no way to know this.


The above scenarios are the classic examples which indicate there is lot of scope for the improvement in the way search engine works.
In all the examples we are dealing with knowledge gap: What a computer understands and able to work with is much more limited than the knowledge of the user.
This lack of knowledge is mainly because of the technological difficulties in getting the computers to understand the natural language or to see the content of the image and the other multimedia.
If we can somehow provide the background information that the search engines can read, the above queries can be easily satisfied.

The semantic web is a concept to apply the advanced technologies in order to fill the gap between the human and the machine.

Without getting into the actual technologies of the semantic web, I will discuss about what can be the background information that can help the search engine to return the correct results.

1. In the first kind of query the user is basically looking for the information related to a person Frank van Harmelen. Some of the meta information (knowledge) that webpages can provide is
“Frank van Harmelen” is professor. He teaches at Vrije University. He has publication FVH98. He works-on “Semantic Web”.
Also the pages which were referring to Frank as “he” can add the meta information that “he” means Frank van Harmelen. The search engine won’t miss the pages that did not refer to him by the complete name.
Also if the search engine have some background information about the user who is tying the query (With the Google/ig , my yahoo the users can have a personalized search page), the user need not type “Frank van Harmelen Vrije university” “Frank Van Harmelen” should be sufficient.
Based on the user profile, his past web history, his online bookmarks, his homepage, the search engine can make some inferences that the user is interested in Semantic Web and that’s why he has typed “Frank Van Harmelen”. The search engine already knows that Frank van Harmelen works on Semantic Web, so the background information has completed the knowledge gap that previously existed.

There are ways to store the information in different forms like RDF, OWL which is out of scope of this article.

Similarly the movie based searches can be again enriched by maintaining some Meta information

Indiana Jones “is-a” Movie

Harrison Ford “acted-in” Indiana Jones

Steven Spielberg “directed” Indiana Jones


So the search engine knows that Harrison Ford and Steven Spielberg both are related to Indiana jones which is a movie. It can return the result as “Indiana Jones”. In this case the search engine can simply return the information as the name of the movie rather than the complete webpage. This is similar to typing “1USD to INR” on Google search page the result is the value 45.98. Google will find the value of the query expression automatically and display it. It’s kind of a calculator application

2. Image searches can be significantly improved by adding the tags describing them. Flickr actively supports tagging. Similarly the other sites which allow adding the images can have ask the users to add tags when they upload the images. The search engines can read these tags to understand the images. The images that are associated with a place, city, and country can be geo-coded. So the user can also query the images by clicking on the particular location on the map and ask for images of that place. The search engine can look at the geocodes of the indexed images and return them in the result
3. Media searches can again be improved by adding meta information about the music.
The meta information will cover the title of the music, the artists, type of music (sad, hip hop), release date etc. This meta information will allow the users to make the searched like
“Give me latest sad songs of XYZ artist”

If the user can maintain a online playlist, or specify the music preferences on the homepage, social networking profile the query can be made more powerful.
“Give me latest music”
Now the search engine knows the users taste and the information about the music. Based on this the search engine can return the correct results.

4. The e-commerce queries can be improved if the stores can maintain the information about the products in an open format which can be easily parsed by the search engines. So the information can say that iPod is of type music player, capacity 4GB, price 300, Currency: USD.
These days Amazon, ebay, Google base have provided a syntax that can be used by the stores (in case of Amazon) or the sellers (ebay, google base) to describe the products. Based on this information the search engines can execute the query “All music players of At least 4GB capacity”.
Since the format will specify very clearly the meaning of the fields, the search engines can be educated about it. Even if the format of each store is different we can write some kind of mappings between the fields. Amazon may call product as PRODUCT while ebay can call it ITEM.

5. The “cheapest iPad” query can be addressed by the above solution. The search engine can find out the prices in different stores by reading the XML tags. Also since the currency units are available. The conversion rates can be applied (read them dynamically from xe.com) and do the conversion

I have just scratched the surface of semantic web. It’s a vast research field with tremendous potential to improve the searching capabilities. Semantic web will certainly change the way the queries are written in the near future.

SMS based internet for social networking

I have heard a lot that people invent the ways when they face some problems. Few days back I also went through the same situation. Here is what gave me an urge to really do something that can help me to utilize my time while I am commuting. Everyday I travel in a BMTC Volvo to work. It is a torture when you have to travel almost for 1 hr standing in a crammed place and with 10 people pushing you every time when the driver presses a brake. Reading books is out of question in this situation. The next best thing is to check my Facebook status or personal mails. But I don’t prefer taking out a flashy phone in these conditions because the chances for the damage are high. Every day I thought how can I use my old phone and still manage to perform some of the common tasks of status updates, checking mails or tweet some of the funny observations while I am standing in the bus. I finally decided to apply technology to improve my life.
Few days back I had read an article regarding text based web. Text based web is basically about obtaining information from the web by sending the SMS to a particular service. There are few SMS platforms like http://www.google.com/mobile/sms/, http://www.txtweb.com which help in fetching the information from the internet and sending to the mobile phones via SMS.

I thought of exploring this concept and try to develop my own applications that can receive the SMS and then perform some operations.
As you all must know that the information in case of website is fetched using the HTTP requests. So when the user types a URL to fetch the page he is basically sending a HTTP request to the application server that can return the HTTP response for that request. The following diagram briefly summarizes that (the details on HTTP is easily available on the net and so I am not going to discuss that)

1. You enter a web page address in your browser’s location bar.
2. Your browser breaks apart that address
and sends the name of the page to the web server. For example, http://www.ndtv.com/index.html would request the page index.html from www.ndtv.com.
3. A program on the web server, called the web server process, takes the request for index.html and looks for this specific file.
4. The web server reads the index.html file from the web server’s hard drive.
5. The web server returns the contents of index.html to your browser.
6. Your web browser uses the HTML markup that was returned from the server to build the rendition of the webpage on your computer screen

Just to put everything in perspective, Facebook, Twitter, Gmail all have some Application server that is listening to your request and respond back with the information.

The platforms like http://www.google.com/mobile/sms/, http://www.txtweb.com, http://www.textmarks.com/ can convert the SMS to an HTTP request and can convert the HTTP response back to a SMS. This is important since you can now communicate with the application server by sending the SMS.


The concept is described in the following diagram



Step 1,2. Using the mobile send a SMS to the SMS platform using the mobile carrier
Step3,4. The SMS platform will convert this SMS to an HTTP request and send it to the application server via internet.
Step 5,6. The Application server will act on the request and then create a HTTP response and send it back to the SMS platform via internet
Step 7. The SMS platform will now convert the HTTP response back to an SMS and send it to the mobile phone via Mobile carrier.
his is different from how the normal apps on smartphones work. They can send/receive a HTTP request/response directly without any conversion so the application communicates almost like a website

This is different from how the normal apps on smartphones work. They can send/receive a HTTP request/response directly without any conversion so the application communicates almost like a website

So with the above concepts in mind I decided to develop my first application. Since the whole idea was born in BMTC Volvo so I decided to create a textApp (named on the lines of mobile apps) to find the routes of Volvo bus. As a user I might like to find out what is the route of “volvo 500K”.
The building blocks of my application are
1. SMS platform
2. Application server hosted on public domain
3. Application to find the bus routes
For SMS platform I used used txtweb (http://www.txtweb.com). It is very simple to use and configure.
The following diagram explains how it work



All you need to do is to choose a keyword and associate it with a URL that will understand this keyword. The URL is the address of your application that will find the routes. This keyword along with the parameters need to be sent to 924334200.
For ex.
@bmtcvolvo 500k
And I get a response as
ROUTE:V500K
Vijayanagar Bus Station=>Vijay Nagar Maruthi Mandir=>R P C Layout=>BHEL Factory=>Veerabhadra Nagar=>Dwaraka Nagar=>Hoskerehalli=>Kamakya (Depot 13)=>Banashankari BDA Complex=>Banashankari Bus Stand=>Jayanagar 5th Block East=>Ragigudda=>BTM Mico Layout=>BTM 16th Main=>Central Silk Board (ORR)=>HSR 14th Main=>Agara=>Jn of Sarjapura Road=>ECO Space (RMZ)=>New Horizion College (ORR)=>J.P.Morgan=>Marathahalli (Mulitplex ORR)=>Spice Garden=>AECS Layout=>Kundalahalli Colony=>I Gate (Perot Systems)=>Sathya Sai Hospital=>ITPL Main Gate

The way it works is
1. You will register a keyword (@bmtcvolvo) on txtweb
2. You then associate the keyword to a URL. So @bmtcvolvo is associated to a URL lets say http://www.testURL.com/bmtcresponse.php
When you do that the txtweb will create a mapping table that will say that if a sms contains @bmtcvolvo the http request is to be sent to http://www.testURL.com/bmtcresponse.php

3. When you send a SMS @bmtcvolvo 500k, the txtweb platform receives it and then finds the mapping for the keyword @bmtcvolvo. It then forms a request
www.testURL.com/bmtcresponse.php?txtweb-message=500k

4. Now my application knows that GET parameter txtweb-message is the bus number.
5. I have fetched the data from BMTC site to construct a database of the bus routes. This table is used to find the route information for 500K
6. The application will then send back the information in the form of http response which basically means to print a html page.
Something like
".$routeNumber."
".$shortRoute."";

7. The txtweb platform will convert this information into an SMS and send it back to the phone.

Once I got it running, I fulfilled my goal of updating Facebook and twitter based on the exactly same approach. All my Facebook updates also contain a “via rPhone” tag just to keep people confuse that I have some new cool phone!!!.
Hopefully my commuting time will become more enjoyable than before.