Advanced Search Techniques and Logical Operators

Advanced Searches

  • Leveraging the Index: Query processors use the index (which tells which pages contain each word) to offer advanced search options, maximizing the efficiency of the "crawl-then-query" model.

  • Google's Advanced Search UI: Google provides a dedicated Advanced Search interface (referred to as Figure 5.4 in the text) with distinct input fields for:

    • Words to be included (AND words).

    • Exact phrases (quoted words or phrases).

    • Words of which at least one must be included (OR words).

    • Words to be excluded (NOT words).

    • The screenshot example showed these logical operators entered as search terms to illustrate their function.

Logical Operators (Boolean Operators)

Logical operators, also known as Boolean operators (named after George Boole), define the logical relationship between search terms.

The Logical Operator AND

  • Definition: An AND-query requires that all specified words must be associated with a page for it to be considered a "hit."

  • Basic Queries: Most basic queries, like "human-powered flight," are implicitly AND-queries. The search engine treats individual words (even hyphenated ones like "human-powered" as "human" and "powered") as independent terms.

  • Example: human AND powered AND flight

    • This means that the logical tests "Is human associated with the page?" AND "Is powered associated with the page?" AND "Is flight associated with the page?" must all be true.

  • Mechanism: This aligns with how query processors work, using the intersection of index lists. Each word has an index list of pages where it appears, and an AND-query finds pages common to all specified lists.

  • User Interface: In Google's Advanced Search UI (Figure 5.4), AND-queries are specified using the "all these words" line.

  • Result Count: AND-queries generally produce fewer search results than OR-queries because they impose stricter conditions.

The Logical Operator OR

  • Definition: An OR-query hits on pages that are associated with one (at least, but possibly more) of the specified words.

  • Example: marshmallow OR strawberry OR chocolate

    • This implies that one or the other (or both) of the logical tests ("Is marshmallow associated with the page?" OR "Is strawberry associated with the page?" OR "Is chocolate associated with the page?") must be true.

  • User Interface: OR-queries can be entered into the third text window of the Advanced Search page (Figure 5.4).

  • Result Count: OR-queries typically yield more results than AND-queries, as they allow for a broader range of matches (e.g., pages with only marshmallow, only strawberry, or both).

  • Handling Duplicates in OR-Queries: When combining index lists for OR-queries (e.g., lisa OR bart OR homer), a page associated with multiple terms (like 'bart' and 'homer') might be listed multiple times. The query processor must remove these duplicates, a process similar to creating "Intersecting Alphabetized Lists."

The Logical Operator NOT

  • Definition: A NOT-query specifies words that must not be associated with the page for it to be a hit.

  • Traditional Operator: The NOT keyword is typically used in advanced searches.

  • Example: tigers NOT baseball would find pages about the animal, excluding those about the Detroit baseball team.

  • Google's Syntax: Google uses the minus character ($-$) as an abbreviation for NOT. The minus sign should be placed immediately next to the word to be excluded.

  • Google Example: tigers -baseball achieves the same result as tigers NOT baseball.

  • User Interface: In Google's Advanced Search UI (Figure 5.4), there's a specific line for specifying words that are "not to be associated with the page."

Quotes in Search Queries

  • Purpose: Using quotes around a phrase (e.g., "human-powered flight") restricts results to pages where that exact phrase appears precisely as given.

  • When to Use: Quotes are essential for:

    • Exact quotations.

    • Book or movie titles.

    • Clichés.

    • Phrases you are certain will appear verbatim on the desired page (e.g., "all your base" for the web meme "All your base are belong to us."). It's not always necessary to quote the entire phrase or title, just the part you are sure will appear.

  • When to Avoid/Be Cautious: Treating query terms independently (without quotes) is often preferred to ensure you don't miss useful results.

    • Example: A search for "human-powered flight" might miss a page with the caption "First pedal-powered flight achieving the human dream of flying" because the exact phrase isn't present, even though all individual words are.

    • Search engines treating words independently (even hyphenated ones) is usually what yields the most relevant hits, as the order or proximity of words might vary across different useful sources.

Combining Logical Operators

  • Arithmetic Analogy: Logical operators function similarly to arithmetic operators, allowing for combination and grouping using parentheses.

  • Clarity with Parentheses: Parentheses are crucial for defining the order of operations and preventing ambiguity.

    • Example for Clarity: (marshmallow OR strawberry OR chocolate) AND sundae

      • This requires that at least one of the specified flavors and the word sundae must be associated with the page. The parentheses clearly group the OR-terms.

  • Ambiguity Without Parentheses: Without parentheses, a query can have multiple meanings.

    • Ambiguous Example: marshmallow OR strawberry OR chocolate AND sundae

    • Possible Meaning 1 (intended): (marshmallow OR strawberry OR chocolate) AND sundae

    • Possible Meaning 2 (alternative): marshmallow OR strawberry OR (chocolate AND sundae)

      • This alternative meaning would allow hits on pages with only marshmallow or only strawberry, since sundae would only be required for chocolate.

  • Google's Flexibility:

    • Implicit AND: Google often interprets a space between words as an implicit AND operator.

    • Grouped OR-words: If OR-words are grouped, they don't always need parentheses in Google searches, assuming the intention is one of the OR-choices combined with the rest of the query. For instance, simpson bart OR lisa OR maggie -homer -marge will correctly search for pages about the Simpson children, excluding Homer and Marge.

  • Recommendation: It's generally wise to use the formal query structure with uppercase logical operators and parentheses for clarity, as not all search software handles implicit operators or ignores parentheses consistently.

  • Minus for NOT: Remember Google's use of the minus sign ($-$) directly preceding a word for NOT (e.g., simpson bart OR lisa OR maggie -homer -marge).

Restricting and Focusing Searches

Search tools allow for greater precision by adding constraints to global searches.

Site Search

  • Definition: Many websites offer a "site search" function, allowing users to search only within that specific site's content.

  • Usage: Useful when you know the information is on a particular site and want to quickly find it without navigating manually.

Focused Searches

  • Challenge: General web searches often yield too many results, making it difficult to find relevant information.

  • Solution: Adding specific constraints can help direct the search more effectively.

Filtered Searches

  • Mechanism: Filtering options, often found at the bottom of Advanced Search UIs, allow users to narrow down results based on specific criteria.

  • Example Scenario (Manifest Arts Festival):

    1. Initial broad search: art manifest yielded many irrelevant results.

    2. Adding a general constraint: Adding "school" (art manifest school) refined the results somewhat, but the desired page was still not in the top 10.

    3. Using a filter (domain restriction): Limiting the search to the .edu domain (the educational domain) for manifest delivered the exact desired page at the top of the search results.

  • Effectiveness: Filtered searches are highly impressive for pinpointing specific information amidst a vast number of initial hits, demonstrating the power of additional constraints.

Page Ranking Local Searches

  • Challenge with Local Site Searches: Large websites (e.g., NPR, IRS, CNN) have their own local search engines, which can be effective for rare terms but tend to produce large, often unordered or unhelpful-ordered (e.g., chronological) hit lists for common terms.

  • Google's Advantage for Local Searches: By performing the same search in Google but restricting it to a specific domain (e.g., "search terms" site:npr.org), one can leverage Google's sophisticated page ranking algorithms.

    • Benefit: This method helps order the hits logically, making the search more effective than the site's internal local search, especially when the hit list is extensive.

    • Mechanism: It harnesses the power of Google's global indexing combined with its intelligent page ranking system for site-specific content. This can convert an unordered or less useful list into a highly relevant, ranked list.