Overview of the Web and Its Technologies

Overview of the Web and Its Technologies

The Internet is the global network of computers; the Web is a hypertext system built on top of it. Hypertext enables linking documents through URLs, turning the web into a "network of documents." Modern web pages convey far more than plain text; URLs can embed parameters that pass additional data to servers.

Definition:

  • URL (Uniform Resource Locator) – the address that identifies a resource on the web and may contain query strings that convey extra information.

Markup Languages

HTML and Markdown

  • HTML (HyperText Markup Language) – the foundational language for structuring web documents.

  • Markdown – a lightweight markup language that abstracts HTML; ideal for beginners and for rapid content creation.

  • Markdown files are often rendered to HTML for display in browsers.

Milestones in Web History

Timeline of Key Events

Year

Milestone

Significance

1991

First web page at CERN (by Tim Berners‑Lee)

Introduced hypertext publishing

Mid‑1990s

First e‑commerce site (online pizza ordering)

Demonstrated commercial potential of the web

Early 2000s

Rise of social networking sites (e.g., MySpace)

Shift from static content to user‑generated content

Search Engine Evolution

Google’s Approach

Google's advantage stemmed from treating the web as a graph and ranking pages by link structure rather than just keyword frequency.

PageRank Algorithm

  • Definition: PageRank – a probability distribution used to rank web pages based on the likelihood that a random surfer lands on a given page.

  • Core Formula:
    PR(i) = rac{1 - d}{N} + d imes extstyle rac{ extstyle
    ange_{j ext{ in } M(i)} PR(j)}{ ext{number of outbound links from } j}

  • The PageRank of page $i$ is computed where:

    • $PR(i)$ is the PageRank of page $i$.

    • $d$ is the damping factor (commonly 0.85).

    • $N$ is the total number of pages in the web graph.

    • $M(i)$ is the set of pages linking to page $i$.

Key Historical Events in Search Engine Development

Search Engine

Launch Year

Core Approach

Yahoo!

1994

Directory‑based indexing

AltaVista

1995

Full‑text indexing, early crawling

Google

1998

PageRank‑driven link analysis

Mobile Web & Constant Connectivity

  • Early web usage was limited to academic and corporate labs.

  • The proliferation of smartphones (iPhone, Android devices) created a "pocket internet," making web access ubiquitous.

  • Mobile browsers and responsive design have become essential for modern web development.

Recommended Lab Activities

  1. HTML Basics: Create a simple page with headings, paragraphs, and links.

  2. Markdown Practice: Convert a short HTML snippet to Markdown and render it to HTML.

  3. URL Dissection: Identify protocol, domain, path, and query parameters in sample URLs.

Key Takeaways

  • The web evolved from a simple hypertext system to a complex platform supporting e-commerce, social media, and mobile access.

  • Markup languages (HTML, Markdown) are the building blocks for presenting content.

  • PageRank introduced a mathematically grounded way to rank pages, paving the way for Google’s dominance.

  • Objective ranking mechanisms are essential because manual ranking of billions of pages is infeasible.

Web Scale & Growth

  • The web contains approximately 1 billion pages (as of 2020) and continues to grow exponentially each year.

  • Traditional search techniques struggle with such size, prompting the need for link‑based algorithms.

Hypertext & Links

  • Definition: Hypertext is text that contains links to other texts, forming a network of directed connections.

  • A hyperlink acts as a vote from the source page to the target page.

  • The collection of all hyperlinks creates a directed graph where:

    • Nodes = web pages

    • Edges = hyperlinks (votes)

PageRank Fundamentals

  • PageRank quantifies the importance of a page based on the quantity and quality of inbound links.

  • Core Formula: PR(A) = rac{1-d}{N} + d imes ange{i=1}^{k} rac{PR(Ti)}{C(T_i)}

    • Symbol definitions:

    • $PR(A)$ is the PageRank of page $A$.

    • $d$ is the damping factor (commonly 0.85).

    • $T_i$ are pages that link to $A$.

    • $C(Ti)$ is the number of outbound links from page $Ti$.

    • $N$ is the total number of pages in the web graph.

Damping Factor and Its Role

  • The damping factor models the probability that a random surfer continues clicking links versus jumping to a random page.

Computation Steps for PageRank

  1. Initialize every page’s rank to a uniform value.

  2. Iterate over all pages, applying the PageRank formula.

  3. Normalize ranks so that the sum of all values equals 1.

  4. Repeat steps 2-3 until changes fall below a small threshold (convergence).

Recursive Importance Propagation

  • PageRank is recursive: a page’s rank depends on the ranks of pages linking to it, which in turn depend on others.

  • Higher-ranked pages passing a link confer greater weight (“vote value”) to the target.

  • This creates a feedback loop where influential clusters amplify each other’s importance.

Example Illustration

  • Example: Page X with 6 inbound links (each from a distinct page) receives a larger rank than a page with only 1 inbound link.

  • If one of those linking pages later gains additional inbound links, its vote weight increases, further boosting X’s rank in subsequent iterations.

Relevance vs. Subjectivity

  • Relevance: How well a page satisfies a specific query; a core component of perceived importance.

  • Subjectivity: Individual user preferences (e.g., "this page answers my personal question").

  • PageRank aims to reduce subjectivity by relying solely on network structure (link counts and link quality) rather than content semantics.

Algorithm Complexity

Aspect

Complexity

1

$O(N)$

2

$O(N)$

3

$O(E)$

4

Typically 20–50 iterations for large web graphs (empirical)

Key Takeaways on PageRank

  • The web’s massive, growing size necessitates link‑based ranking.

  • Hyperlinks act as votes, and their aggregation through a recursive algorithm yields an objective importance metric.

  • PageRank balances quantity (number of links) and quality (rank of linking pages) to produce a stable ranking useful for search engines.

Shortest Paths in a Network

  • Definition: The shortest path is the minimal-length route between two nodes.

  • Calculating shortest paths for every pair of nodes reveals which nodes lie on many of these routes.

  • Example questions:

    • What is the shortest path between Utah and Carnegie?

    • What is the shortest path between Grand and MIT?

Betweenness Centrality

  • A node’s betweenness centrality measures how often it appears on the shortest paths between other node pairs.

  • Nodes with high betweenness act as “bridges”; removing them can split the network into disconnected components.

  • In the described network, Utah has the highest betweenness because it lies on the majority of the shortest paths.

Time and Space Complexity

  • Time per iteration is $O(E)$, where $E$ is the number of hyperlinks.

  • Space required for storing rank values is $O(N)$, where $N$ is the total number of nodes.

  • Convergence typically requires 20–50 iterations for large web graphs (empirical).

Principle of Betweenness Centrality

  • Betweenness centrality indicates a node’s importance as an intermediary; high values imply that the node is critical for network connectivity.

Comparing Centrality Measures

  • Real‑World Analogies:

    • Road-trip analogy: A direct route from Utah to MIT is analogous to the shortest network path. Taking a long detour represents a non-optimal path.

    • Hollywood network analogy: Samuel Jackson is highlighted as the central figure in the Hollywood graph due to many connections and a low “Bacon number.”

Key Takeaways on Centrality Measures

  • Shortest-path analysis helps identify critical nodes.

  • Betweenness centrality is a primary metric for spotting “bridge” nodes whose loss would fragment the network.

  • High degree and high betweenness both signal importance, but they reflect different aspects of connectivity.

Practical Steps to Compute Betweenness Centrality

  1. List all node pairs (s, t) in the network.

  2. Find all shortest paths between each pair.

  3. Count how many of those shortest paths pass through the target node $v$.

  4. Sum the counts over all pairs to obtain $v$’s betweenness score.

  5. These steps illustrate the process without requiring additional code or formulas.