Overview of the Web and Its Technologies
Overview of the Web and Its Technologies
The Internet is the global network of computers; the Web is a hypertext system built on top of it. Hypertext enables linking documents through URLs, turning the web into a "network of documents." Modern web pages convey far more than plain text; URLs can embed parameters that pass additional data to servers.
Definition:
URL (Uniform Resource Locator) – the address that identifies a resource on the web and may contain query strings that convey extra information.
Markup Languages
HTML and Markdown
HTML (HyperText Markup Language) – the foundational language for structuring web documents.
Markdown – a lightweight markup language that abstracts HTML; ideal for beginners and for rapid content creation.
Markdown files are often rendered to HTML for display in browsers.
Milestones in Web History
Timeline of Key Events
Year | Milestone | Significance |
|---|---|---|
1991 | First web page at CERN (by Tim Berners‑Lee) | Introduced hypertext publishing |
Mid‑1990s | First e‑commerce site (online pizza ordering) | Demonstrated commercial potential of the web |
Early 2000s | Rise of social networking sites (e.g., MySpace) | Shift from static content to user‑generated content |
Search Engine Evolution
Google’s Approach
Google's advantage stemmed from treating the web as a graph and ranking pages by link structure rather than just keyword frequency.
PageRank Algorithm
Definition: PageRank – a probability distribution used to rank web pages based on the likelihood that a random surfer lands on a given page.
Core Formula:
PR(i) = rac{1 - d}{N} + d imes extstyle rac{ extstyle
ange_{j ext{ in } M(i)} PR(j)}{ ext{number of outbound links from } j}The PageRank of page $i$ is computed where:
$PR(i)$ is the PageRank of page $i$.
$d$ is the damping factor (commonly 0.85).
$N$ is the total number of pages in the web graph.
$M(i)$ is the set of pages linking to page $i$.
Key Historical Events in Search Engine Development
Search Engine | Launch Year | Core Approach |
|---|---|---|
Yahoo! | 1994 | Directory‑based indexing |
AltaVista | 1995 | Full‑text indexing, early crawling |
1998 | PageRank‑driven link analysis |
Mobile Web & Constant Connectivity
Early web usage was limited to academic and corporate labs.
The proliferation of smartphones (iPhone, Android devices) created a "pocket internet," making web access ubiquitous.
Mobile browsers and responsive design have become essential for modern web development.
Recommended Lab Activities
HTML Basics: Create a simple page with headings, paragraphs, and links.
Markdown Practice: Convert a short HTML snippet to Markdown and render it to HTML.
URL Dissection: Identify protocol, domain, path, and query parameters in sample URLs.
Key Takeaways
The web evolved from a simple hypertext system to a complex platform supporting e-commerce, social media, and mobile access.
Markup languages (HTML, Markdown) are the building blocks for presenting content.
PageRank introduced a mathematically grounded way to rank pages, paving the way for Google’s dominance.
Objective ranking mechanisms are essential because manual ranking of billions of pages is infeasible.
Web Scale & Growth
The web contains approximately 1 billion pages (as of 2020) and continues to grow exponentially each year.
Traditional search techniques struggle with such size, prompting the need for link‑based algorithms.
Hypertext & Links
Definition: Hypertext is text that contains links to other texts, forming a network of directed connections.
A hyperlink acts as a vote from the source page to the target page.
The collection of all hyperlinks creates a directed graph where:
Nodes = web pages
Edges = hyperlinks (votes)
PageRank Fundamentals
PageRank quantifies the importance of a page based on the quantity and quality of inbound links.
Core Formula: PR(A) = rac{1-d}{N} + d imes ange{i=1}^{k} rac{PR(Ti)}{C(T_i)}
Symbol definitions:
$PR(A)$ is the PageRank of page $A$.
$d$ is the damping factor (commonly 0.85).
$T_i$ are pages that link to $A$.
$C(Ti)$ is the number of outbound links from page $Ti$.
$N$ is the total number of pages in the web graph.
Damping Factor and Its Role
The damping factor models the probability that a random surfer continues clicking links versus jumping to a random page.
Computation Steps for PageRank
Initialize every page’s rank to a uniform value.
Iterate over all pages, applying the PageRank formula.
Normalize ranks so that the sum of all values equals 1.
Repeat steps 2-3 until changes fall below a small threshold (convergence).
Recursive Importance Propagation
PageRank is recursive: a page’s rank depends on the ranks of pages linking to it, which in turn depend on others.
Higher-ranked pages passing a link confer greater weight (“vote value”) to the target.
This creates a feedback loop where influential clusters amplify each other’s importance.
Example Illustration
Example: Page X with 6 inbound links (each from a distinct page) receives a larger rank than a page with only 1 inbound link.
If one of those linking pages later gains additional inbound links, its vote weight increases, further boosting X’s rank in subsequent iterations.
Relevance vs. Subjectivity
Relevance: How well a page satisfies a specific query; a core component of perceived importance.
Subjectivity: Individual user preferences (e.g., "this page answers my personal question").
PageRank aims to reduce subjectivity by relying solely on network structure (link counts and link quality) rather than content semantics.
Algorithm Complexity
Aspect | Complexity |
|---|---|
1 | $O(N)$ |
2 | $O(N)$ |
3 | $O(E)$ |
4 | Typically 20–50 iterations for large web graphs (empirical) |
Key Takeaways on PageRank
The web’s massive, growing size necessitates link‑based ranking.
Hyperlinks act as votes, and their aggregation through a recursive algorithm yields an objective importance metric.
PageRank balances quantity (number of links) and quality (rank of linking pages) to produce a stable ranking useful for search engines.
Shortest Paths in a Network
Definition: The shortest path is the minimal-length route between two nodes.
Calculating shortest paths for every pair of nodes reveals which nodes lie on many of these routes.
Example questions:
What is the shortest path between Utah and Carnegie?
What is the shortest path between Grand and MIT?
Betweenness Centrality
A node’s betweenness centrality measures how often it appears on the shortest paths between other node pairs.
Nodes with high betweenness act as “bridges”; removing them can split the network into disconnected components.
In the described network, Utah has the highest betweenness because it lies on the majority of the shortest paths.
Time and Space Complexity
Time per iteration is $O(E)$, where $E$ is the number of hyperlinks.
Space required for storing rank values is $O(N)$, where $N$ is the total number of nodes.
Convergence typically requires 20–50 iterations for large web graphs (empirical).
Principle of Betweenness Centrality
Betweenness centrality indicates a node’s importance as an intermediary; high values imply that the node is critical for network connectivity.
Comparing Centrality Measures
Real‑World Analogies:
Road-trip analogy: A direct route from Utah to MIT is analogous to the shortest network path. Taking a long detour represents a non-optimal path.
Hollywood network analogy: Samuel Jackson is highlighted as the central figure in the Hollywood graph due to many connections and a low “Bacon number.”
Key Takeaways on Centrality Measures
Shortest-path analysis helps identify critical nodes.
Betweenness centrality is a primary metric for spotting “bridge” nodes whose loss would fragment the network.
High degree and high betweenness both signal importance, but they reflect different aspects of connectivity.
Practical Steps to Compute Betweenness Centrality
List all node pairs (s, t) in the network.
Find all shortest paths between each pair.
Count how many of those shortest paths pass through the target node $v$.
Sum the counts over all pairs to obtain $v$’s betweenness score.
These steps illustrate the process without requiring additional code or formulas.