Sunday, June 19, 2011

Can Semantic web improve the shortcomings of Keyword Search?

Currently most of the popular search engines are based on the keyword search algorithm which means, they try to search for the pages which contain the keywords entered by the user and provide you the results. There are certain problems with the keyword based search mechanisms which can be understood by taking few common scenarios (I have taken few examples from the book Social Networks and the Semantic Web by Peter Mika. I have added my explanations to make these scenarios easy to understand):
1. Who is Frank van Harmelen? (Basic keyword based search)
Suppose I want to query this with the search engine like Google. And I entered the keyword Harmelen. Google has no idea about my intentions it will just do a keyword search and return me the results that will be about persons with some portion of name as Harmelen, some product on Amazon whose name is Harmelen and so on . Basically if we look into the conversation mode then the query that we did and the response from the search engines will look like
Q: Who is Frank van Harmelen?
A: I don’t know but there are over a million documents with the word “harmelen” on them and I found them all really fast (0.31s). Further,you can buy Harmelen at Amazon. Free Delivery on Orders Over 15.

Upon closer inspection the problem becomes clear: the word Harmelen means a number of things. It is a name of few people (Frank van Harmelen or Mark van Harmelen etc). Harmelen is also a small town in the Netherlands and the place for a tragic train accident. There is also some products on Amazon with the name Harmelen. So basically the search engine has thrown a wide range of results.

We will then start adding some more information to the keywords and will specify the full name that we are interested in “Frank van Harmelen”.
This will reduce the result set but still there are lot of extra results. For e.g. the user might be interested in Frank van Harmelen, Professor of Vrije University. But the result will have the web pages containing any Frank van Harmelen

The user can further refine the search results by typing more keywords “Frank van Harmelen professor of Vrije University” which will return pages containing the correct Frank van Harmelen.
So if we analyze the refinements we are basically providing more and more information in the keywords to guide the search engine to return the correct results.

Still there are some problems in the results. There can be multiple scenarios where the even though information on the page is relevant to the Frank Van Harmelen but we are not getting them on the search results. There might be webpages where the name is not mentioned as Frank van Harmelen but still they have been written by Frank or there are web pages who are referring to Frank van Harmelen as “he” or Frank might have written some articles/books which the webpage is referring as FVH98.
The search engine will miss out these pages which can be quite relevant to the person who is querying since it has got no information that the pages related to Frank Van Harmelen

There are some common queries that user wants to perform but are impossible with the current searching mechanism. For e.g. a movie buff might want to search “Give me all the movies which were directed by Steven Spielberg and acted by Harrison Ford”.
The search engine will just return the pages where both of them are listed together. The search cannot understand the meaning of the web pages and cannot figure out if the page is indicating that two of them have worked together or it’s an information about a Movie party where two of them were present.
Only efficient way is to manually break the query as “Find movies by Harrison Ford”, “Find movies by Steven Spielberg”. Now manually check the crew members in both the set and finally come up with the result “Indiana Jones”

2. Image search.
A user might be interested in searching for the pictures of Paris city. On Google image search if you type “Paris” you will get more results related to Paris Hilton than Paris city. The problem with the image search is more profound then with the regular search. The problem is because associating the photos with the keywords is much more difficult than simple looking for the keywords in the text of the documents. It’s very easy for the human to find out that the results are not correct (Paris Hilton and Paris city are different things) but the computer can do a visual verification of the results. Automatic image recognition is currently not a matured field.
The images that were shown basically correspond to the pages where the images were placed and the pages contained the keyword “Paris”.

3. Find new Music that I like

These kind of queries is at an even higher level of difficulty. From the perspective of automation, music retrieval is just as problematic as image search. The search engine can try to avoid the problem by not going into the content of the music but only get the clues of the performer or the genre from the web pages that have links to the music files. But there are some practical problems. The music is very fast moving so the content of the indexed pages might have changed. The search engines typically index the Web once a month and therefore too slow for the fast moving world of music releases.

The user can give the query like “Give me all the latest sad English songs” or “give me songs which are hip hop” so we are trying to query for songs from different artists and filter them on the type of song.

The search engine has no way to find out the content of the music and what is the meaning of the music.

Ideally a user would like that the search engine fetches the music related information from user’s online playlist that the user maintains and based on the type of songs the search engine can return them the latest songs.

4. Tell me about the music players with a capacity of at least 4 GB

So basically this is an ecommerce query. The user is looking for a product with certain characteristics.
The basic problem is that translating this query from natural language to the Boolean language of search engines is (almost) impossible. We could try the search “music player” “4GB” but it is clear that the search engine will not know that 4GB is the capacity of the music player and we are interested in all players with at least that much memory. The query will just return the pages containing the keywords as “music player” and 4GB. It cannot even make a choice if iPod, mp3 player should be considered as music player or not. In simple words if a web page is talking about a 4GB iPod, the page will not appear in the result set. The search engines don’t know what is a music player or a mp3 player. It is just a program capable of searching keywords from billions of pages across internet.

Also even if we somehow add modify the algorithm to make the search engines understand that iPod, mp3, mp4 are all kinds of music player the task of reading the web page to find out the capacity is quite tedious and erratic. The search engine will have to figure out where the information related to capacity has been mentioned. It has to parse the HTML page to find the information. Parsing the HTML page to find information cannot be generalized simply because every web page is created differently. Some html pages can have presentation information inside the page some can have a separate css file. Some of the HTML pages might be generating the information dynamically so the indexing is also not possible.
In short with the current search algorithm the query “give me all music players with a capacity of at least 4 GB “ is impossible to make

5. Tell me where I can find the cheapest iPad 16GB

The query is little different than the above since here the user has told specifically that he is looking for iPad 16 GB. So the search engine can go to Amazon, Google Froogle, ebay etc and get you the specific product. But the catch is the word “cheapest”
The user wants only the cheapest price. The search engine has to again parse the price related information. As discussed above the parsing of HTML pages for price information is very difficult. The price might be shown as “Total”, “Cost” etc. Also the various sites can give the value in different currency so 400$ < 380 euros but the search engine has no way to know this.

The above scenarios are the classic examples which indicate there is lot of scope for the improvement in the way search engine works.
In all the examples we are dealing with knowledge gap: What a computer understands and able to work with is much more limited than the knowledge of the user.
This lack of knowledge is mainly because of the technological difficulties in getting the computers to understand the natural language or to see the content of the image and the other multimedia.
If we can somehow provide the background information that the search engines can read, the above queries can be easily satisfied.

The semantic web is a concept to apply the advanced technologies in order to fill the gap between the human and the machine.

Without getting into the actual technologies of the semantic web, I will discuss about what can be the background information that can help the search engine to return the correct results.

1. In the first kind of query the user is basically looking for the information related to a person Frank van Harmelen. Some of the meta information (knowledge) that webpages can provide is
“Frank van Harmelen” is professor. He teaches at Vrije University. He has publication FVH98. He works-on “Semantic Web”.
Also the pages which were referring to Frank as “he” can add the meta information that “he” means Frank van Harmelen. The search engine won’t miss the pages that did not refer to him by the complete name.
Also if the search engine have some background information about the user who is tying the query (With the Google/ig , my yahoo the users can have a personalized search page), the user need not type “Frank van Harmelen Vrije university” “Frank Van Harmelen” should be sufficient.
Based on the user profile, his past web history, his online bookmarks, his homepage, the search engine can make some inferences that the user is interested in Semantic Web and that’s why he has typed “Frank Van Harmelen”. The search engine already knows that Frank van Harmelen works on Semantic Web, so the background information has completed the knowledge gap that previously existed.

There are ways to store the information in different forms like RDF, OWL which is out of scope of this article.

Similarly the movie based searches can be again enriched by maintaining some Meta information

Indiana Jones “is-a” Movie

Harrison Ford “acted-in” Indiana Jones

Steven Spielberg “directed” Indiana Jones

So the search engine knows that Harrison Ford and Steven Spielberg both are related to Indiana jones which is a movie. It can return the result as “Indiana Jones”. In this case the search engine can simply return the information as the name of the movie rather than the complete webpage. This is similar to typing “1USD to INR” on Google search page the result is the value 45.98. Google will find the value of the query expression automatically and display it. It’s kind of a calculator application

2. Image searches can be significantly improved by adding the tags describing them. Flickr actively supports tagging. Similarly the other sites which allow adding the images can have ask the users to add tags when they upload the images. The search engines can read these tags to understand the images. The images that are associated with a place, city, and country can be geo-coded. So the user can also query the images by clicking on the particular location on the map and ask for images of that place. The search engine can look at the geocodes of the indexed images and return them in the result
3. Media searches can again be improved by adding meta information about the music.
The meta information will cover the title of the music, the artists, type of music (sad, hip hop), release date etc. This meta information will allow the users to make the searched like
“Give me latest sad songs of XYZ artist”

If the user can maintain a online playlist, or specify the music preferences on the homepage, social networking profile the query can be made more powerful.
“Give me latest music”
Now the search engine knows the users taste and the information about the music. Based on this the search engine can return the correct results.

4. The e-commerce queries can be improved if the stores can maintain the information about the products in an open format which can be easily parsed by the search engines. So the information can say that iPod is of type music player, capacity 4GB, price 300, Currency: USD.
These days Amazon, ebay, Google base have provided a syntax that can be used by the stores (in case of Amazon) or the sellers (ebay, google base) to describe the products. Based on this information the search engines can execute the query “All music players of At least 4GB capacity”.
Since the format will specify very clearly the meaning of the fields, the search engines can be educated about it. Even if the format of each store is different we can write some kind of mappings between the fields. Amazon may call product as PRODUCT while ebay can call it ITEM.

5. The “cheapest iPad” query can be addressed by the above solution. The search engine can find out the prices in different stores by reading the XML tags. Also since the currency units are available. The conversion rates can be applied (read them dynamically from and do the conversion

I have just scratched the surface of semantic web. It’s a vast research field with tremendous potential to improve the searching capabilities. Semantic web will certainly change the way the queries are written in the near future.

No comments: