Extracting Information From Data

Qubits: Future Ready CS Curriculum

Learning Objectives

  • Understand the distinction between data and information.
  • Understand the processes of collecting and storing data.
  • Understand the concept of metadata.
  • Know about cleaning data.

Outline

  • Data comprises raw, unprocessed facts or details without inherent meaningful context.
  • Information is processed, interpreted, structured, and presented data within a context to make it useful.
  • This sprint focuses on processing data to extract useful information.

What is Data and Information?

  • Data comprises raw, unprocessed facts or details that lack inherent context.
  • Data can exist in various formats (numbers, text, images, audio, video) and often requires processing to become useful.
  • Information is data that has been processed, interpreted, structured, and presented in a context to be useful.
  • Information is used for making decisions, predictions, and identifying patterns.
  • Data science and business intelligence involve manipulating data to identify interesting patterns.

Why is Data Important?

  • Data provides opportunities for identifying trends, making connections, and addressing problems.
  • Example: Analyzing student course enrollment data to identify trends.
  • Graphical representations convert numerical data into visual patterns, which makes it easier to understand trends and comparisons.

Collecting Data

  • Gathering data is often the first step in strategic planning and business decision-making.
Key Points for Collecting Research-Oriented Data:
  • Set the project goal by formulating research questions that data analysis should answer.
  • Identify the type of data required for analysis.
  • Verify whether the collected data can answer the research questions.
  • Determine the appropriate data collection method.
  • Research various other data sources that can provide data.
Open Datasets
  • Data can be accessed globally if it's of good quantity and quality.
  • Numerous open datasets are publicly available for research.
  • It's crucial to check licensing terms and attributes when using these datasets.
Examples:
  • The World Bank dataset (https://data.worldbank.org/indicator) includes statistical data from macro, financial, and sector databases.
  • UNdata (https://data.un.org/) is a web-based data service providing access to international statistical databases.

Storing Data

Storing Data - CSV (Comma-Separated Values)
  • Text-based data is often recommended to be stored in CSV format.
  • CSV files are widely used due to their compatibility for data storage, exchange, and sharing, as well as their ease of handling.
  • Example: A student enrollment dataset in CSV format, where each value is separated by a comma.
  • The first row typically contains header data, followed by rows representing the header's values.
CSV Limitations:
  • CSV files lack built-in tools for data retrieval or manipulation.
  • Spreadsheet applications can be used to work with CSV files.
  • Programming languages like Python and R are commonly used to manipulate large datasets.
Storing Data - Databases
  • Data is stored in tables within databases, and queries are used to extract data.
  • Each table features a key for unique identification of data.
  • Each table has a primary key, a unique identifier for each record (like a student ID or customer number).
  • Related tables can be grouped together for analysis.
  • Databases offer more reliability due to easy data backups.
  • Data manipulations are more straightforward compared to CSV files.
  • A query is a request for data or information from a database, typically written in a query language like SQL (Structured Query Language).

Metadata

  • Metadata is data about data.
  • Example: A picture taken on a mobile phone is stored with data such as the date, time, size, device information, camera specifications, and location details.
  • Metadata is found everywhere (e.g., labels on books, product tags, movie tickets).
  • It is available with images, videos, text files, or any stored file.
  • Metadata does not affect the original data when changes are made to it.
Use of Metadata
  • Metadata provides sensible details about large datasets.
  • It simplifies searching, sorting, and grouping of data.
  • Metadata helps store data in a structured and organized way.
  • Data can be effectively used, and better analysis can be performed when stored along with metadata.

Activity: Collecting Data – Group Work (20 Minutes)

  1. Choose a Topic from the following examples or propose your own:
    • Healthy lifestyle habits among students
    • Screen time and study performance
    • Favorite social media platforms and usage patterns
    • Opinions on school canteen food
    • Environmental awareness and recycling behavior
  2. Define the Purpose: Clearly state what you want to find out from your survey (e.g., “To understand how much time students spend on their phones after school”).
  3. Plan the Survey Questions: Your form should include:
    • 3–4 multiple-choice questions
    • 1 rating scale or Like scale
  4. Design the Form using a digital tool like Google Forms, Microsoft Forms, or any survey creator.
  5. Submit the Following:
    • A link to the form
    • A short paragraph explaining: What kind of data you hope to collect? , How this data could help solve or understand a problem
  6. Bonus challenge for bonus marks: Create a visual representation that analyzes one key aspect of your survey results, such as the ratings or most selected responses.

Cleaning Data

  • Collected data may contain incorrect information due to data entry errors, software errors, or lack of uniformity.
  • Correcting data before analysis is crucial.
  • The process of making data suitable for analysis is called cleaning data.
Example of Data Cleaning Issues:
  • Course name inconsistencies (e.g., "Computer Science" recorded as both "CS" and "Comp Sc.").
  • Inconsistent date formats in the "Year of enrollment" column.
  • Misspellings (e.g., "Health Science" incorrectly spelled).
  • Impossible values (e.g., "Number of enrollments" exceeding maximum permitted enrollments).
Causes of Non-Uniform Data
  • Lack of uniformity in data collection.
  • Input errors by data entry operators or users.
  • Abbreviations, misspellings, or varying capitalizations in open fields.
  • Data cleaning involves standardizing abbreviations, spellings, and capitalizations to a single form without altering the meaning.

Challenges in Collecting and Processing Data

  • Non-uniform data
  • Invalid data
  • Incomplete data
  • Combined data from multiple sources