Friday, October 10, 2008

Week 7 Readings

I'll be commenting on:
Challenges in Web Search Engines by Monika Henzinger, Rajeev Motwani, and Craig Silverstein
How Things Work: Web Search Engines: Part 1 by David Hawking
How Things Work: Web Search Engines: Part 2 by David Hawking

Both the Hawking articles have been very informative. What I've learned:

  • indexing the web is not impossible, although it still seems like it should be to my uneducated mind
  • Crawling and indexing takes a vast amount of terabytes and requires very dependable servers
  • Speed, politeness, and content (both excluded and duplicate) must be accounted for when crawling
  • Decoding is a big block that crawlers have to deal with
Part 2
  • postings lists are important for a document or web page to be indexed- without them the id of the page isn't found
  • scanning and inversion make up the two main parts of indexing a page
  • scale: web-sized crawling requires a huge scale. A big challenge.
  • Searching terms and words that are used in webpages is a daunting task bc there is no real way to know if it is a spelling error or if it is a nondictionary word. Example: "teh"
  • Anchor text helps to link pages together
  • Queries inhabit a large portion of how and why webpages are indexed. Queries show how often a page is searched and what terms are used to search it
  • Query processors need to be advanced to yield good results
  • Caching: I am hesitant to believe that caching really will help to provide good results from a query. Just because a page has the word doesn't mean that it is provided in the context the searcher is looking for

Challenges in Web Search Engines
  • The idea of spam on the web is addressed first, and I think it's addressed first bc it represents such a real problem because people don't often look past the first page of results and if this is loaded with spam then essentially the search yielded no results for the patron
  • Both Text Spam and Link Spam try to manipulate the search using keywords that would be used to search for the page and trying to add as many as possible- often hidden or in a link farm on the bottom of the page.
  • Cloaking is used to deceive search engines like Google, but I don't really understand how it works
  • Quality should be based on more than just the amount of a certain word on a page but it seems impossible for a human indexer to be employed to help differentiate these types of things: ways to do it? Have users give feedback.

Overall it seems that what I learned in my Indexing class here in the Information Science school can only loosely be related to web indexing. Searching the web is a complicated process that relies mostly on machine-run indexing ways and does not involve a human that can actually determine the things only humans can determine- things like how words are perceived by patrons and what words are actually on a page as relevant.

Web search engines can only run at a particular way and leave guidance to the user- but if the user is uneducated on how search engines work, then it will be hard for him/her to get the best results. Patience is key.

Muddiest Point for Last Week:

Are there written requirements for the proposal on Courseweb?
Where are the written requirements for Assignment 3 located on Courseweb?

No comments: