Lec - 02 - Final - When Data Meets Statistics

Concepts and Technology of AI - Lecture 02: When Data Meets Statistics

Introduction to Data

  • Definition of Data: Data refers to factual information that can be utilized for reasoning, discussion, or calculation. This concept can be understood in two perspectives:

    • Information (Singular): When data is treated as a single entity or concept that encompasses knowledge.

    • Facts (Plural): When data is viewed as individual points of information that contribute to a broader understanding. An example of this is that having only one data point is problematic as it lacks context, making it challenging to draw reliable conclusions or insights.

Key Definitions of Data

  • Factual Information: This includes basic measurements or statistics that serve as foundational elements for more complex calculations or analyses.

  • Digital Information: Information that is stored in a digital format, making it readily accessible for processing and analysis using computer systems.

  • Sensing Device Output: Data produced by various sensing devices can include both valuable insights and irrelevant information, which necessitates a processing phase to enhance clarity and usability.

  • Course Definition: A comprehensive definition includes a collection of facts regarding objects or phenomena that comprise both variables and measurements:

    • Variables: These are defined as specific descriptors of data points, such as objects or phenomena being observed.

    • Measurements: These are specific, quantifiable facts like the following:

      • The sky's color: Blue

      • Height of Mt. Everest: 8848 meters

      • Temperature range: 23–33 °C

Types of Data Measurements

  • Quantitative Data: This type of data is expressed as numbers, facilitating easier manipulation and statistical representation. It is further divided into two primary types:

    • Discrete Data: Countable values that correspond to isolated points on a number line, such as the number of students in a classroom.

    • Continuous Data: Values that can take on any value within a given range on a number line, like the weights of individuals in kilograms.

  • Qualitative Data: This refers to categorical data that cannot be easily quantified or numbered, but can be sorted into categories. This type is further categorized as:

    • Nominal Data: Data without a natural order, such as categories like colors or types of fruit.

    • Ordinal Data: Data that have a natural order or ranking, for example, clothing sizes labeled as S, M, L.

Forms of Data

  • Raw Formats: Data can exist in various forms, and not all formats are inherently useful for analysis. Raw data refers to unprocessed collected facts which may require cleaning and organizing to derive insights.

Organizing Data

  • Need for Organization: Raw data lacks a structured form, making it difficult to analyze effectively. Thus, organization is essential for clarity and utility:

    • Structured Data: Organized data presented in a defined format, which is easily processed by computer systems. This type of data is often stored in databases.

    • Unstructured Data: This type lacks clear organization, complicating analysis and requiring additional efforts to structure and analyze.

Data Science Eco-system

  • Process Outline: The data science process typically involves several steps:

    1. Ask an Interesting Question: Formulating a research question to guide the data analysis.

    2. Sample and Collect the Data: Gathering data relevant to the research question.

    3. Data Wrangling: Cleaning and organizing data to prepare it for analysis.

    4. Explore the Data: Initial examination of data to identify trends or patterns.

    5. Data Analysis: Applying statistical methods to analyze the data.

    6. Statistical Learning: Employing algorithms for predictive modeling.

    7. Model Building: Creating models to understand data better or predict outcomes.

    8. Metric Evaluation: Assessing the performance of models based on predefined metrics.

    9. Result/Decision Presentation: Communicating findings effectively to stakeholders.

Data Collection Techniques

  • Searching for Relevant Data: Before collecting data, it is important to evaluate sources for quality and accuracy:

    • Confirm if the data is comprehensive and up-to-date.

  • Collecting Data: Common methods of data collection include:

    • Web Scraping: Gathering data from websites and online sources.

    • Surveys or Questionnaires: Conducting studies to gather primary data directly from individuals.

    • Secondary Sources: Utilizing existing databases or literature that provide relevant data.

  • Considerations in Data Collection: When collecting data, one must focus on ensuring dataset quality and authority:

    • Evaluate potential biases and understand the contexts under which data were gathered to ensure objectivity in analysis.

Comprehensive vs. Sampled Data

  • Comprehensive Data: This includes all relevant data points for an entire population, providing a complete view of the subject matter.

  • Sampled Data: A subset of the entire population is analyzed, acknowledging that sampling methods may introduce biases that could affect results.

Biases in Data Sampling

  • Types of Bias: Recognizing various biases is crucial:

    • Selection Bias: Systematic exclusion of certain subjects leading to a non-representative sample.

    • Survivor Bias: Only considering subjects that have successfully completed trials or conditions, disregarding dropouts or failures.

    • Simpson's Paradox: A phenomenon where group trends appear to reverse when analyzed in aggregated data rather than in segregated contexts.

Measures of Central Tendency

  • Mean, Median, Mode: These are essential statistical measures summarizing center values in datasets, each serving specific purposes in data analysis:

    • Mean: The arithmetic average, sensitive to extreme values.

    • Median: The middle value in a dataset, less influenced by extremes.

    • Mode: The most frequently occurring value in the dataset.

Measures of Dispersion

  • Spread: Key statistics like range, variance, and standard deviation provide insights into how data points differ from the mean:

    • Range: The difference between the maximum and minimum values.

    • Variance: The average of the squared differences from the mean, revealing the data's spread.

    • Standard Deviation: The square root of variance, providing a measure of how spread out the values are around the mean.

Data Visualization and Interpretation

  • Graphical Methods: Techniques such as histograms, scatter plots, and box plots facilitate the visualization of data distributions, enabling quick insights into the shape, central tendencies, and dispersion of the data.

Bivariate Analysis

  • Correlation vs. Causation: While analyzing relationships between two variables, it is critical to interpret correlation without jumping to causative conclusions, avoiding the common fallacy that correlation implies causation.

robot