Q: Who is Frank van Harmelen?
A: I don’t know but there are over a million documents with the word “harmelen” on them and I found them all really fast (0.31s). Further,you can buy Harmelen at Amazon. Free Delivery on Orders Over 15.
This will reduce the result set but still there are lot of extra results. For e.g. the user might be interested in Frank van Harmelen, Professor of Vrije University. But the result will have the web pages containing any Frank van Harmelen
So if we analyze the refinements we are basically providing more and more information in the keywords to guide the search engine to return the correct results.
The search engine will miss out these pages which can be quite relevant to the person who is querying since it has got no information that the pages related to Frank Van Harmelen
The search engine will just return the pages where both of them are listed together. The search cannot understand the meaning of the web pages and cannot figure out if the page is indicating that two of them have worked together or it’s an information about a Movie party where two of them were present.
Only efficient way is to manually break the query as “Find movies by Harrison Ford”, “Find movies by Steven Spielberg”. Now manually check the crew members in both the set and finally come up with the result “Indiana Jones”
A user might be interested in searching for the pictures of Paris city. On Google image search if you type “Paris” you will get more results related to Paris Hilton than Paris city. The problem with the image search is more profound then with the regular search. The problem is because associating the photos with the keywords is much more difficult than simple looking for the keywords in the text of the documents. It’s very easy for the human to find out that the results are not correct (Paris Hilton and Paris city are different things) but the computer can do a visual verification of the results. Automatic image recognition is currently not a matured field.
The images that were shown basically correspond to the pages where the images were placed and the pages contained the keyword “Paris”.
These kind of queries is at an even higher level of difficulty. From the perspective of automation, music retrieval is just as problematic as image search. The search engine can try to avoid the problem by not going into the content of the music but only get the clues of the performer or the genre from the web pages that have links to the music files. But there are some practical problems. The music is very fast moving so the content of the indexed pages might have changed. The search engines typically index the Web once a month and therefore too slow for the fast moving world of music releases.
The user can give the query like “Give me all the latest sad English songs” or “give me songs which are hip hop” so we are trying to query for songs from different artists and filter them on the type of song.
The search engine has no way to find out the content of the music and what is the meaning of the music.
Ideally a user would like that the search engine fetches the music related information from user’s online playlist that the user maintains and based on the type of songs the search engine can return them the latest songs.
So basically this is an ecommerce query. The user is looking for a product with certain characteristics.
The basic problem is that translating this query from natural language to the Boolean language of search engines is (almost) impossible. We could try the search “music player” “4GB” but it is clear that the search engine will not know that 4GB is the capacity of the music player and we are interested in all players with at least that much memory. The query will just return the pages containing the keywords as “music player” and 4GB. It cannot even make a choice if iPod, mp3 player should be considered as music player or not. In simple words if a web page is talking about a 4GB iPod, the page will not appear in the result set. The search engines don’t know what is a music player or a mp3 player. It is just a program capable of searching keywords from billions of pages across internet.
In short with the current search algorithm the query “give me all music players with a capacity of at least 4 GB “ is impossible to make
The query is little different than the above since here the user has told specifically that he is looking for iPad 16 GB. So the search engine can go to Amazon, Google Froogle, ebay etc and get you the specific product. But the catch is the word “cheapest”
The user wants only the cheapest price. The search engine has to again parse the price related information. As discussed above the parsing of HTML pages for price information is very difficult. The price might be shown as “Total”, “Cost” etc. Also the various sites can give the value in different currency so 400$ < 380 euros but the search engine has no way to know this.
The above scenarios are the classic examples which indicate there is lot of scope for the improvement in the way search engine works.
In all the examples we are dealing with knowledge gap: What a computer understands and able to work with is much more limited than the knowledge of the user.
This lack of knowledge is mainly because of the technological difficulties in getting the computers to understand the natural language or to see the content of the image and the other multimedia.
If we can somehow provide the background information that the search engines can read, the above queries can be easily satisfied.
Without getting into the actual technologies of the semantic web, I will discuss about what can be the background information that can help the search engine to return the correct results.
1. In the first kind of query the user is basically looking for the information related to a person Frank van Harmelen. Some of the meta information (knowledge) that webpages can provide is
“Frank van Harmelen” is professor. He teaches at Vrije University. He has publication FVH98. He works-on “Semantic Web”.
Also the pages which were referring to Frank as “he” can add the meta information that “he” means Frank van Harmelen. The search engine won’t miss the pages that did not refer to him by the complete name.
Also if the search engine have some background information about the user who is tying the query (With the Google/ig , my yahoo the users can have a personalized search page), the user need not type “Frank van Harmelen Vrije university” “Frank Van Harmelen” should be sufficient.
Based on the user profile, his past web history, his online bookmarks, his homepage, the search engine can make some inferences that the user is interested in Semantic Web and that’s why he has typed “Frank Van Harmelen”. The search engine already knows that Frank van Harmelen works on Semantic Web, so the background information has completed the knowledge gap that previously existed.
Indiana Jones “is-a” Movie
Harrison Ford “acted-in” Indiana Jones
Steven Spielberg “directed” Indiana Jones
So the search engine knows that Harrison Ford and Steven Spielberg both are related to Indiana jones which is a movie. It can return the result as “Indiana Jones”. In this case the search engine can simply return the information as the name of the movie rather than the complete webpage. This is similar to typing “1USD to INR” on Google search page the result is the value 45.98. Google will find the value of the query expression automatically and display it. It’s kind of a calculator application
The meta information will cover the title of the music, the artists, type of music (sad, hip hop), release date etc. This meta information will allow the users to make the searched like
“Give me latest sad songs of XYZ artist”
If the user can maintain a online playlist, or specify the music preferences on the homepage, social networking profile the query can be made more powerful.
“Give me latest music”
Now the search engine knows the users taste and the information about the music. Based on this the search engine can return the correct results.
These days Amazon, ebay, Google base have provided a syntax that can be used by the stores (in case of Amazon) or the sellers (ebay, google base) to describe the products. Based on this information the search engines can execute the query “All music players of At least 4GB capacity”.
Since the format will specify very clearly the meaning of the fields, the search engines can be educated about it. Even if the format of each store is different we can write some kind of mappings between the fields. Amazon may call product as PRODUCT while ebay can call it ITEM.
I have just scratched the surface of semantic web. It’s a vast research field with tremendous potential to improve the searching capabilities. Semantic web will certainly change the way the queries are written in the near future.