Effective Internet Searching

Effective Internet Searching Learning Objectives

Explain what constitutes effective Internet searching.
Recall that search engines consist of a crawler and a query processor, and describe their functions.
Utilize complex queries, employing logical operators such as AND, OR, and NOT, to efficiently locate information.
Demonstrate how to filter and subtract search terms to exclude irrelevant results.
Illustrate the application of the CRAAP method for evaluating the validity of retrieved information.
Analyze and cross-reference discovered information with additional sources to ensure accuracy.

Glossary Terms

It is crucial to read and internalize the detailed definitions for the following terms provided in the Glossary section of your textbook, as they offer a deeper understanding of the concepts and are frequently included in quizzes and tests:

AND-queries
Authoritative
Crawler
Index
Intersects
Logical operator
OR-query
PageRank
Primary source
Query processor
Search engine
Secondary source
Site search
Tertiary source
Tokens
Unmediated

Fundamentals of the Search Engine

To effectively find information on the World Wide Web, users rely on search engines, which are collections of sophisticated computer programs.
Crucially, the vast amount of information posted on the Web is inherently unorganized and unmediated by any central authority.

Search Engine Operation

Search engines systematically traverse the Web to identify available information and then organize it for retrieval.

Crawling and Index Building

The initial function of a search engine is crawling, where it visits every accessible web page.
The component responsible for this crawling function is called the crawler.
The crawler is initially provided with a seed list of URLs to begin its exploration.
When the crawler discovers a new URL on a page, it adds that URL to its list of pages to examine in the future.
The primary objective of the crawler is to construct an index.
Indexes are constructed from lists of words, known as tokens, which are associated with specific web pages.
The crawler compiles a list of URLs uniquely associated with each identified token.

Query Processing

Users submit tokens, also referred to as search terms, to the query processor.
The search engine then searches its pre-built index for these words and returns a hit list.
The remarkable speed with which search engines can produce hit lists is a direct result of the index being created and maintained proactively.

Multiple-Word Searches and Intersecting Queries

When a user submits a multiple-word query, the expectation is that all returned pages will coincide with every single word in the query.
This type of query is fundamentally an AND-query.
For multiple-word queries, the query processor retrieves individual index lists for each of the search terms.
The core task is then to intersect these lists, examining and comparing all URLs that appear in every list.
The query processor performs this intersection, identifying common URLs.
To enhance processing speed, the resulting list of common URLs is then alphabetized.
- Tokens can originate from various parts of a web page, such as the page's title.
- Index entries are generated exclusively for individual words, not phrases.
- Alphabetization significantly simplifies the process of identifying when the same URL appears in more than one list.

Rules for Intersecting Alphabetized Lists

To effectively intersect alphabetized lists, follow these steps:

Place a marker or arrow at the beginning of each token's index list.
If all markers currently point to the identical URL, that URL should be saved as it is associated with every token in the query.
For the URL that is alphabetically earliest among those currently pointed to by the markers, move its corresponding marker to the next position in its list.
Repeat steps 2 and 3 continuously until at least one marker reaches the end of its respective list.

Indexed Search's Power

A search engine performs the following key functions:

It dedicates the necessary computational time to crawl over web pages (data).
It utilizes the gathered information to construct a comprehensive index.
It efficiently locates the index entries for every word in a user's query.
It finds the required information for an AND-query by intersecting the relevant lists.
Remarkably, once an index is built, a search engine can return an answer to a query within approximately 0.25 seconds, referencing billions of web pages within its index.

Terminology for Items that Affect Relevancy

Tokens are not equally important; their location in descriptive places on a page significantly impacts their relevancy:

Title (<title> tag): A short phrase found within these tags that summarizes the entire page's content.
Anchor text (<a> tag): The visible, clickable text within link tags, which provides a summary of the page it links to.
Meta tag (<meta> tag): Located within the <head> section of an HTML document, this tag often contains a summary description of the page's content.
Alt attributes (<img> tag): An attribute within an image tag that provides a textual summary or description of the image content.
<h1> tags: Top-level header text, often indicating the main topic of a section or the page itself.

PageRank or Relevancy

The PageRank algorithm is the primary determinant of the order in which hits are returned by a search query.
A higher PageRank directly translates to a placement closer to the top of the search results list.
This mechanism is why the page a user is typically searching for often appears within the top 10 results on a search engine.

Voting Through Links for PageRank

Google revolutionized search ranking by pioneering a system of page ranking based on a