Digital Musings: Week 7 Readings

I'll be commenting on:
Challenges in Web Search Engines by Monika Henzinger, Rajeev Motwani, and Craig Silverstein
How Things Work: Web Search Engines: Part 1 by David Hawking
How Things Work: Web Search Engines: Part 2 by David Hawking

Both the Hawking articles have been very informative. What I've learned:

indexing the web is not impossible, although it still seems like it should be to my uneducated mind
Crawling and indexing takes a vast amount of terabytes and requires very dependable servers
Speed, politeness, and content (both excluded and duplicate) must be accounted for when crawling
Decoding is a big block that crawlers have to deal with

Part 2

postings lists are important for a document or web page to be indexed- without them the id of the page isn't found
scanning and inversion make up the two main parts of indexing a page
scale: web-sized crawling requires a huge scale. A big challenge.
Searching terms and words that are used in webpages is a daunting task bc there is no real way to know if it is a spelling error or if it is a nondictionary word. Example: "teh"
Anchor text helps to link pages together
Queries inhabit a large portion of how and why webpages are indexed. Queries show how often a page is searched and what terms are used to search it
Query processors need to be advanced to yield good results

Caching: I am hesitant to believe that caching really will help to provide good results from a query. Just because a page has the word doesn't mean that it is provided in the context the searcher is looking for

Challenges in Web Search Engines

The idea of spam on the web is addressed first, and I think it's addressed first bc it represents such a real problem because people don't often look past the first page of results and if this is loaded with spam then essentially the search yielded no results for the patron
Both Text Spam and Link Spam try to manipulate the search using keywords that would be used to search for the page and trying to add as many as possible- often hidden or in a link farm on the bottom of the page.
Cloaking is used to deceive search engines like Google, but I don't really understand how it works
Quality should be based on more than just the amount of a certain word on a page but it seems impossible for a human indexer to be employed to help differentiate these types of things: ways to do it? Have users give feedback.

Overall it seems that what I learned in my Indexing class here in the Information Science school can only loosely be related to web indexing. Searching the web is a complicated process that relies mostly on machine-run indexing ways and does not involve a human that can actually determine the things only humans can determine- things like how words are perceived by patrons and what words are actually on a page as relevant.

Web search engines can only run at a particular way and leave guidance to the user- but if the user is uneducated on how search engines work, then it will be hard for him/her to get the best results. Patience is key.

Muddiest Point for Last Week:

Are there written requirements for the proposal on Courseweb?
Where are the written requirements for Assignment 3 located on Courseweb?

Friday, October 10, 2008

Week 7 Readings

No comments:

Digital Musings

Blog Archive

About Me

My Blog List