2.1.4 - Data Sources - too much

Overview

Today's data comes from many different places, changes quickly, and behaves in varied ways.
- Common data sources:
  - Relational Databases
  - Flat files and XML Datasets
  - APIs and Web Services
  - Web Scraping
  - Data Streams
  - Feeds
- This builds on the idea from the last lesson about a huge amount of data.
- Knowing how each data source is set up, how often it updates, and how to get to it is key for picking the right ways to store, combine, and analyze data.

Internal Relational Databases

These are the main storage for most company computer systems (like for managing day to day activities…customer relationships, human resources, or daily transactions/workflow).
Common relational database programs used are SQL Server, Oracle, MySQL, IBM DB2. Store data in a structured way.
How they work:
- Data is put into connected tables with clear rules, links between tables, and careful handling of transactions (ACID guarantees).
- They are good for OLTP (many quick, small changes to data).
How they help with analysis:
- Sales data from stores can be used to see sales trends in different regions.
- Customer data can help predict future demand and sales.
What this means for practical use: Data is often moved from these databases to larger data storage areas (data warehouses/lakes).

External Datasets

These are data sources from outside a company, either public (like government websites) or commercial (sold by data companies).
Examples:
- Government numbers about people, large economic facts.
- Past weather data, pictures from satellites.
- Purchased sales scans, information about financial markets.
Business value:
- Helps with setting company plans, predicting what customers will buy, making delivery better, and focusing marketing efforts.
Usually come as simple text files, spreadsheets, or XML; you need to follow rules about using them and privacy.

Flat Files & Spreadsheets

What a flat file is: A plain text file where each line is one piece of information, and different pieces are separated by a character like a comma or tab.
- It represents one simple table (not connecting different tables).
CSV (Comma-Separated Values):
- Very common, easy to read, simple to send in parts.
Spreadsheets = “better flat files”:
- Can have many sheets, each like a different table.
- Can include colors, math rules, pictures – things that are not actual data often need to be removed carefully.
- Common types and tools: XLS/XLSX (Excel), Google Sheets, Apple Numbers, LibreOffice Calc.
Importance: The first step for many data analysis tasks; simple but can cause problems with many different versions (“spreadsheet mess”).

XML Datasets

XML = eXtensible Markup Language; data is put inside custom labels ↔ it describes itself.
Can handle complex, nested structures, like orders with many items, or survey questions with sub-questions.
Common uses:
- Data exported from online surveys.
- Digital bank statements.
- Standard ways to exchange information in industries (e.g., HL7 for healthcare, FINXML for finance).
To read XML, you need special tools (XPath/XQuery or DOM/SAX); often changed to JSON or simpler table forms.

APIs & Web Services

These are programs that listen for requests from other programs over the internet and send back information that computers can read (XML, JSON, CSV, HTML, media).
How data is gotten:
1. Your program sends a request with details and security codes.
2. The service sends back the information plus a code showing if it worked ( $200$ OK, $404$ Not Found, etc.).
Examples of APIs:
- Social media: Twitter & Facebook Graph APIs → get tweets/posts to understand feelings or opinions.
- Markets: Stock market APIs → get stock prices, earnings, past price movements.
- Data quality: Services to map a ZIP code to a City/State, improve addresses, or remove duplicate contacts.
Company connections: internal small programs or ways to share database data (ODATA, REST-SQL links).
Rules to follow: respect how many requests you can make, user privacy, and terms of service.

Web Scraping

Also called screen scraping, web harvesting, web data extraction.
Goal: Take specific pieces of information from web pages not set up for data extraction.
What it can do:
- Collect text, contact details, pictures, videos, prices, reviews.
- Programs can follow rules in robots.txt (or sometimes ignore them, which can be an ethical problem).
Popular uses:
- Checking competitor prices from online stores.
- Finding new contacts from public lists.
- Watching online forums for public opinion.
- Making large sets of data for training machine learning (e.g., product descriptions, picture captions).
Tools & programs:
- Beautiful Soup (Python) – good for messy HTML web pages.
- Scrapy – a complete system for automatically gathering web data.
- Pandas – quick way to get tables from web pages using read_html().
- Selenium – controls web browsers for sites that use a lot of JavaScript.

Data Streams & Feeds

What they are: Constant, fast, time-marked pieces of information sent from different sources.
Where they come from:
- IoT sensors, factory machines, GPS devices, social media, website clicks.
Often include location details (latitude/longitude) + times → useful for location-time analysis.
Business situations:
- Stock market data → for quick trading, understanding market changes.
- Store sales data → for checking inventory in real-time.
- Security camera video → for finding threats using computer vision.
- Social media posts → for live views of public feelings.
- Website clicks → for making websites better, testing different versions.
- Airline event data → for automatic rebooking and flight changes.
Programs for handling them:
- Apache Kafka (a system for handling many messages).
- Apache Spark Streaming / Structured Streaming (for small groups of data at a time or continuous data).
- Apache Storm (for fast processing of data streams).

RSS Feeds

RSS = Really Simple Syndication; a system based on XML for letting people know when new content is available.
Great for news sites, blogs, forums where content updates often.
How to use them:
- A special program checks the RSS XML → shows new information in order.
- Helps analysts get headlines, article details, or full articles to find trends.
Compared to data streams: Slower, you have to ask for the data, but still can be automated.

Ethical, Philosophical, & Practical Implications

Who owns the data & permission: taking data from websites versus using APIs must follow copyright rules and terms of service.
Bias & representation: social media APIs might only show views from people who post a lot.
Security: sharing internal databases through APIs requires user checks (like OAuth $2.0$ , API keys) and secure connections (TLS).
Sustainability: constantly checking feeds/streams uses a lot of computing power and energy.
Management: changes to data setup, tracking data's path, and managing different versions are especially hard for data that is partly structured (XML/JSON) and streaming data.

Connections & Real-World Relevance

Links to the last lesson on ETL/ELT pipelines: each data source needs a different way to get data (all at once vs. parts at a time vs. tracking changes).
Data engineers must balance speed, amount of data handled, and consistency when combining traditional databases with streams.
Marketing, finance, supply chain, and manufacturing teams actively mix internal company data with weather, economic, and social information to gain an advantage.

Key Tools & Technologies Cheat-Sheet

Databases: SQL Server, Oracle, MySQL, IBM DB2.
File types: CSV, TSV, XLS/XLSX, XML.
Spreadsheet programs: Excel, Google Sheets, Apple Numbers, LibreOffice.
Web/Cloud APIs: REST, GraphQL, ODATA, SOAP.
Web Scraping: Beautiful Soup, Scrapy, Selenium, Pandas.
Streaming: Apache Kafka, Spark Streaming, Apache Storm.
Feed readers: Feedly, Inoreader, custom Python feedparser.

Key Takeaways

No single data source is enough; modern data analysis systems must combine data from databases, files, APIs, scraped web pages, and streams.
It's fundamental to understand how data is set up (tables vs. nested), how it's sent (all at once vs. real-time), and any rules for using it.
Choosing the right tools should match the speed, size, variety, and truthfulness of the data—the classic “4 Vs.”