Challenges in Web Search Engines by Monika Henzinger, Rajeev Motwani, and Craig Silverstein
How Things Work: Web Search Engines: Part 1 by David Hawking
How Things Work: Web Search Engines: Part 2 by David Hawking
Both the Hawking articles have been very informative. What I've learned:
- indexing the web is not impossible, although it still seems like it should be to my uneducated mind
- Crawling and indexing takes a vast amount of terabytes and requires very dependable servers
- Speed, politeness, and content (both excluded and duplicate) must be accounted for when crawling
- Decoding is a big block that crawlers have to deal with
- postings lists are important for a document or web page to be indexed- without them the id of the page isn't found
- scanning and inversion make up the two main parts of indexing a page
- scale: web-sized crawling requires a huge scale. A big challenge.
- Searching terms and words that are used in webpages is a daunting task bc there is no real way to know if it is a spelling error or if it is a nondictionary word. Example: "teh"
- Anchor text helps to link pages together
- Queries inhabit a large portion of how and why webpages are indexed. Queries show how often a page is searched and what terms are used to search it
- Query processors need to be advanced to yield good results
- Caching: I am hesitant to believe that caching really will help to provide good results from a query. Just because a page has the word doesn't mean that it is provided in the context the searcher is looking for
Challenges in Web Search Engines
- The idea of spam on the web is addressed first, and I think it's addressed first bc it represents such a real problem because people don't often look past the first page of results and if this is loaded with spam then essentially the search yielded no results for the patron
- Both Text Spam and Link Spam try to manipulate the search using keywords that would be used to search for the page and trying to add as many as possible- often hidden or in a link farm on the bottom of the page.
- Cloaking is used to deceive search engines like Google, but I don't really understand how it works
- Quality should be based on more than just the amount of a certain word on a page but it seems impossible for a human indexer to be employed to help differentiate these types of things: ways to do it? Have users give feedback.
Overall it seems that what I learned in my Indexing class here in the Information Science school can only loosely be related to web indexing. Searching the web is a complicated process that relies mostly on machine-run indexing ways and does not involve a human that can actually determine the things only humans can determine- things like how words are perceived by patrons and what words are actually on a page as relevant.
Web search engines can only run at a particular way and leave guidance to the user- but if the user is uneducated on how search engines work, then it will be hard for him/her to get the best results. Patience is key.
Muddiest Point for Last Week:
Are there written requirements for the proposal on Courseweb?
Where are the written requirements for Assignment 3 located on Courseweb?
No comments:
Post a Comment