Real time indexing in Google
Real time indexing has become the new holy grail in the search engine business. Pandia takes a look at how Google is doing.
The Twitter effect
It is Twitter that is the culprit. If you follow the debate that is taking place there on a specific topic, you may get the news as soon as they happen.
Google (NSDQ:GOOG) has to compete with that, but the traditional web search engines were not really built for that kind of immediate indexing.
First of all it takes time for the search engines to visit all the existing web sites (Google now has some 1 trillion web pages in its index), and if you want your search engine index to stay perfectly fresh, you would have to revisit all those pages once every — well, I don’t know — every five minutes, maybe?
That kind of indexing traffic would cause the Internet to kneel and webmasters to despair. It is not feasible.
Google makes a compromise. How often Google revisits a site is dependent on its popularity, authority (i.e. to what extent other “good” sites link to it) and its updating frequency.
The large news sites and popular blogs will be revisited several times per hour — yes, as often as every five minutes if needed. Your grandmother’s site on cute kittens will have to wait a little longer for Google’s crawlers.
Universal Search
It is not necessarily the regular web spiders that cause this recency.
The search result pages you see will contain a mix of search results, some from traditional web search, some from news search and some from blog search, all powered by different indexing systems.
Google News has its own index, where the core consists of data from high authority sites that are visited by Google’s news search spiders.
Google Blog Search relies on the RSS feeds of blogs. First they make use of info from the RSS-file, then they visit the blog to fetch all the text contained in a blog post. Most blog platforms will “ping” Google automatically every time a new post is added. Normally Google will have indexed that post within 10 minutes.
Not indexing Twitter tweets
Google is not indexing regular Twitter tweets on a real time basis (although Twitter account pages may be included in search results).
The fact that Twitter may banning on expanding the database of its own search engine to include not only the text of the tweets themselves, but also the content of the pages they are linking to, must be giving Google cause for concern.
By doing this Twitter will get a very powerful real time search engine, indeed.
How to see the freshest content
Google Operating System has a post that tells you how to find the pages Google has indexed during the last minute.
Here’s how you do it:
1. Do a search on Google, let’s say for “search engine marketing”
2. Click on “Show options”
3. Click on “Past 24 hours” (in the left hand margin)
4. Edit the URL in the URL field of your browser by replacing “tbs=qdr:d” with “tbs=qdr:n”
5. Hit Enter
As Google OS points out the date restriction feature is quite floogble.
This is ut syntax used by Google’s URLs:
tbs=qdr:[name][value]
where [name] can be one of these values: s (second), n (minute), h (hour), d (day), w (week), m (month), y (year), while [value] is a relevant number.
photo credit: Pink Sherbet Photography
Recent news from Pandia
Top 5 search engines for kids
Pandia Search Engine News Wrap-up Nov 15
Search the real time web with LeapFish
Pandia Search Engine News Wrap-up Nov 8 2009
Google Dashboard tells you what Google knows about you
Google Books gets browse magazine page
Top 5 sites for social search
Webmaster World’s PubCon is back in Vegas
Pandia Search Engine News Halloween Wrap-up
Google’s new revenue stream: books and music
The truth about ISPs and Network Neutrality
Combine search, bookmarks and RSS with 43 Marks
Twitter tests lists
Pandia Search Engine News Wrap-up Oct 18
Find quality recipes
Learning search engine and social media marketing






















