| | ||||||
|
|
|
|
|
||
|
Pandia Search Central |
Pandia Post Newsletter No. 11 2001 Part 5 BOOKS The Invisible Web
According to search engine experts Chris Sherman and Gary Price the Invisible Web consists of material that general-purpose search engines either cannot or will not include in their collections of Web pages. In their recent book The Invisible Web they define four types
of invisibility: The Opaque Web consist of files that can be, but are not, included in search engine indexes. Some search engines limit the number of webpages they include from each site as this kind of "crawling" demands a lot of resources. Then there is the frequency of the crawl. There might be new pages that have not yet been included by the search engine. Some pages cannot be found, as there are no links to it from other webpages or because the webmaster has failed to inform the search engine about its existence. The Private Web consists of webpages that are available but that have been deliberately excluded by the webmasters themselves. They may be password protected, or the webmaster might have included a "noindex" metatag or used a so-called robots.txt file to instruct the search engine robot to skip the page. The Proprietary Web consists of pages that's only accessible to people who have registered to view them. Search engine robots cannot fill in a form, so there is no way they can get into restricted portions of a site. The Truly Invisible Web consists of content that cannot be indexed for technical reasons. The documents may be in a file format that is not recognized by the search engine robot. Until quite recently most traditional search engines indexed regular webpages only (i.e. HTML-based documents). Google will include Acrobat PDF, postscript, Microsoft Office files now, but most search engines do not. Then there is the dynamically generated webpages, i.e. webpages that are generated on the fly by a script that queries a database. Search engines will normally (but not always) avoid such pages as they are afraid that they will be trapped in an endless loop. Moreover, they cannot fill in a form, so if the site requires you to fill in a form to get access to information, the search engine will not find it. As Sherman and Price will tell you the webadresses of dynamically generated
pages often include special characters like ? and &. For instance
http://www.pandia.com/index.html
is a regular static webpage. http://www.pandia.com/cgi-local/meta.pl?search=%22Gary+Price%22 All this means that there is a lot of valuable information out there that cannot be found using the traditional search engines. And that is what this book is about. Not only do Sherman and Price give an excellent introduction to the concept of the Invisible Web, they also tell you how to access this part of the Internet. The second half of the book is actually a well described catalog of portals that presents Invisible Web content as well as directory of more than 1000 selected Invisible Web sites. The actual selection of directory sites is a bit puzzling, as some categories are well represented, others are not, but the directory provides a lot of useful information just the same. The authors themselves say that because the Invisible Web is so huge, and constantly changing, creating a totally comprehensive directory is virtually impossible. Their goal was to go for quality over quantity-though they continue to add new resources as they find them. The problem with printing Web resource catalogs in books is that they become outdated very fast. That is why they have decided to publish an updated version of the directory on the Web, at http://www.invisible-web.net/. This book has actually been criticized for including too much general information on search engines and Web searching, and if you are of the busy type that stick to the executive summary this book is probably not for you (although you can always skip the first chapters). We enjoyed the historical and technical introduction to Web searching very much, though. It actually makes this book a useful introduction to Internet searching in general. Note that the Pandia Powersearch All-in-One search page also has a section on Invisible Web resources. Buy
this book from Amazon.com: Pandia Powersearch http://www.pandia.com/powersearch/index.html#specialized How to Search the WebThere is no way we can give you the objective truth about this book, as we have written it ourselves. It is the first in a series of three ebooks on search engines published by the Intellectua ebook company. How to Search the Web is a short and concise"three minute" tutor on efficient Internet searching, a bit similar to our Goalgetter Web Search Tutorial. Unlike the Goalgetter tutorial, however, this guide is published in the popular Acrobat PDF format, meaning that you can print it in an easy to read format and read it in bed if you want to. All Web addresses are clickable, but they are also given in full, so that you can use your paper copy as a source of URLs. The ebook covers all the major search engines and directories, and gives an easy to understand introduction to more advanced Internet searching. Click here to read more about this ebook: http://www.dirtsmart.com/titles/3mt0010.html?10389 More books on search engines and Internet searching FINALLY...Do you like Pandia? Feel free to forward this newsletter to a friend. Click here to recommend the Pandia site to a friend: http://www.recommend-it.com/l.z.e?s=328530 Go to http://www.pandia.com/post/ to find information on how to subscribe and unsubscribe. The Pandia Post is edited by Per and Susanne Koch, Sign up for our free newsletter today! |