Extracting Information From Data
Qubits: Future Ready CS Curriculum
Learning Objectives
- Understand the distinction between data and information.
- Understand the processes of collecting and storing data.
- Understand the concept of metadata.
- Know about cleaning data.
Outline
- Data comprises raw, unprocessed facts or details without inherent meaningful context.
- Information is processed, interpreted, structured, and presented data within a context to make it useful.
- This sprint focuses on processing data to extract useful information.
- Data comprises raw, unprocessed facts or details that lack inherent context.
- Data can exist in various formats (numbers, text, images, audio, video) and often requires processing to become useful.
- Information is data that has been processed, interpreted, structured, and presented in a context to be useful.
- Information is used for making decisions, predictions, and identifying patterns.
- Data science and business intelligence involve manipulating data to identify interesting patterns.
Why is Data Important?
- Data provides opportunities for identifying trends, making connections, and addressing problems.
- Example: Analyzing student course enrollment data to identify trends.
- Graphical representations convert numerical data into visual patterns, which makes it easier to understand trends and comparisons.
Collecting Data
- Gathering data is often the first step in strategic planning and business decision-making.
Key Points for Collecting Research-Oriented Data:
- Set the project goal by formulating research questions that data analysis should answer.
- Identify the type of data required for analysis.
- Verify whether the collected data can answer the research questions.
- Determine the appropriate data collection method.
- Research various other data sources that can provide data.
Open Datasets
- Data can be accessed globally if it's of good quantity and quality.
- Numerous open datasets are publicly available for research.
- It's crucial to check licensing terms and attributes when using these datasets.
Examples:
- The World Bank dataset (https://data.worldbank.org/indicator) includes statistical data from macro, financial, and sector databases.
- UNdata (https://data.un.org/) is a web-based data service providing access to international statistical databases.
Storing Data
Storing Data - CSV (Comma-Separated Values)
- Text-based data is often recommended to be stored in CSV format.
- CSV files are widely used due to their compatibility for data storage, exchange, and sharing, as well as their ease of handling.
- Example: A student enrollment dataset in CSV format, where each value is separated by a comma.
- The first row typically contains header data, followed by rows representing the header's values.
CSV Limitations:
- CSV files lack built-in tools for data retrieval or manipulation.
- Spreadsheet applications can be used to work with CSV files.
- Programming languages like Python and R are commonly used to manipulate large datasets.
Storing Data - Databases
- Data is stored in tables within databases, and queries are used to extract data.
- Each table features a key for unique identification of data.
- Each table has a primary key, a unique identifier for each record (like a student ID or customer number).
- Related tables can be grouped together for analysis.
- Databases offer more reliability due to easy data backups.
- Data manipulations are more straightforward compared to CSV files.
- A query is a request for data or information from a database, typically written in a query language like SQL (Structured Query Language).
- Metadata is data about data.
- Example: A picture taken on a mobile phone is stored with data such as the date, time, size, device information, camera specifications, and location details.
- Metadata is found everywhere (e.g., labels on books, product tags, movie tickets).
- It is available with images, videos, text files, or any stored file.
- Metadata does not affect the original data when changes are made to it.
- Metadata provides sensible details about large datasets.
- It simplifies searching, sorting, and grouping of data.
- Metadata helps store data in a structured and organized way.
- Data can be effectively used, and better analysis can be performed when stored along with metadata.
Activity: Collecting Data – Group Work (20 Minutes)
- Choose a Topic from the following examples or propose your own:
- Healthy lifestyle habits among students
- Screen time and study performance
- Favorite social media platforms and usage patterns
- Opinions on school canteen food
- Environmental awareness and recycling behavior
- Define the Purpose: Clearly state what you want to find out from your survey (e.g., “To understand how much time students spend on their phones after school”).
- Plan the Survey Questions: Your form should include:
- 3–4 multiple-choice questions
- 1 rating scale or Like scale
- Design the Form using a digital tool like Google Forms, Microsoft Forms, or any survey creator.
- Submit the Following:
- A link to the form
- A short paragraph explaining: What kind of data you hope to collect? , How this data could help solve or understand a problem
- Bonus challenge for bonus marks: Create a visual representation that analyzes one key aspect of your survey results, such as the ratings or most selected responses.
Cleaning Data
- Collected data may contain incorrect information due to data entry errors, software errors, or lack of uniformity.
- Correcting data before analysis is crucial.
- The process of making data suitable for analysis is called cleaning data.
Example of Data Cleaning Issues:
- Course name inconsistencies (e.g., "Computer Science" recorded as both "CS" and "Comp Sc.").
- Inconsistent date formats in the "Year of enrollment" column.
- Misspellings (e.g., "Health Science" incorrectly spelled).
- Impossible values (e.g., "Number of enrollments" exceeding maximum permitted enrollments).
- Lack of uniformity in data collection.
- Input errors by data entry operators or users.
- Abbreviations, misspellings, or varying capitalizations in open fields.
- Data cleaning involves standardizing abbreviations, spellings, and capitalizations to a single form without altering the meaning.
Challenges in Collecting and Processing Data
- Non-uniform data
- Invalid data
- Incomplete data
- Combined data from multiple sources