Real time indexing in Google
Real time indexing has become the new holy grail in the search engine business. Pandia takes a look at how Google is doing.
The Twitter effect
It is Twitter that is the culprit. If you follow the debate that is taking place there on a specific topic, you may get the news as soon as they happen.
Google (NSDQ:GOOG) has to compete with that, but the traditional web search engines were not really built for that kind of immediate indexing.
First of all it takes time for the search engines to visit all the existing web sites (Google now has some 1 trillion web pages in its index), and if you want your search engine index to stay perfectly fresh, you would have to revisit all those pages once every — well, I don’t know — every five minutes, maybe?
That kind of indexing traffic would cause the Internet to kneel and webmasters to despair. It is not feasible.
Google makes a compromise. How often Google revisits a site is dependent on its popularity, authority (i.e. to what extent other “good” sites link to it) and its updating frequency.
The large news sites and popular blogs will be revisited several times per hour — yes, as often as every five minutes if needed. Your grandmother’s site on cute kittens will have to wait a little longer for Google’s crawlers.
Universal Search
It is not necessarily the regular web spiders that cause this recency.
The search result pages you see will contain a mix of search results, some from traditional web search, some from news search and some from blog search, all powered by different indexing systems.
Google News has its own index, where the core consists of data from high authority sites that are visited by Google’s news search spiders.
Google Blog Search relies on the RSS feeds of blogs. First they make use of info from the RSS-file, then they visit the blog to fetch all the text contained in a blog post. Most blog platforms will “ping” Google automatically every time a new post is added. Normally Google will have indexed that post within 10 minutes.
Not indexing Twitter tweets
Google is not indexing regular Twitter tweets on a real time basis (although Twitter account pages may be included in search results).
The fact that Twitter may banning on expanding the database of its own search engine to include not only the text of the tweets themselves, but also the content of the pages they are linking to, must be giving Google cause for concern.
By doing this Twitter will get a very powerful real time search engine, indeed.
How to see the freshest content
Google Operating System has a post that tells you how to find the pages Google has indexed during the last minute.
Here’s how you do it:
1. Do a search on Google, let’s say for “search engine marketing”
2. Click on “Show options”
3. Click on “Past 24 hours” (in the left hand margin)
4. Edit the URL in the URL field of your browser by replacing “tbs=qdr:d” with “tbs=qdr:n”
5. Hit Enter
As Google OS points out the date restriction feature is quite floogble.
This is ut syntax used by Google’s URLs:
tbs=qdr:[name][value]
where [name] can be one of these values: s (second), n (minute), h (hour), d (day), w (week), m (month), y (year), while [value] is a relevant number.
photo credit: Pink Sherbet Photography
Recent news from Pandia
Pandia Search Engine News Wrap-up March 13
SMX Advanced Search Engine Marketing Expo for experienced marketers
AltSearchEngines.com is no more
Link building for high quality links
Pandia Search Engine News Wrap-up March 7
Top 5 mobile search engines
The Italian Google case is a threat to the social side of Web publishing
Pandia Search Engine News Wrap-up Feb 28
Google adds Nearby search option for local search
State of Search: New site on search and search engine marketing
Surf the web anonymously with Startpage.com
Pandia Search Engine News Wrap-up Feb 21
Yahoo! switches to Bing search results
Experts trace Google hackers to Chinese schools
Pandia’s all-in-one search tool collection has been updated
On Google Buzz and Other Search Engine News






















