Extracting Information From Data

Data comprises raw, unprocessed facts or details without inherent meaningful context.
Information is processed, interpreted, structured, and presented data within a context to make it useful.
This sprint focuses on processing data to extract useful information.

Data comprises raw, unprocessed facts or details that lack inherent context.
Data can exist in various formats (numbers, text, images, audio, video) and often requires processing to become useful.
Information is data that has been processed, interpreted, structured, and presented in a context to be useful.
Information is used for making decisions, predictions, and identifying patterns.
Data science and business intelligence involve manipulating data to identify interesting patterns.

Data provides opportunities for identifying trends, making connections, and addressing problems.
Example: Analyzing student course enrollment data to identify trends.
Graphical representations convert numerical data into visual patterns, which makes it easier to understand trends and comparisons.

Gathering data is often the first step in strategic planning and business decision-making.

Set the project goal by formulating research questions that data analysis should answer.
Identify the type of data required for analysis.
Verify whether the collected data can answer the research questions.
Determine the appropriate data collection method.
Research various other data sources that can provide data.

The World Bank dataset (https://data.worldbank.org/indicator) includes statistical data from macro, financial, and sector databases.
UNdata (https://data.un.org/) is a web-based data service providing access to international statistical databases.

Text-based data is often recommended to be stored in CSV format.
CSV files are widely used due to their compatibility for data storage, exchange, and sharing, as well as their ease of handling.
Example: A student enrollment dataset in CSV format, where each value is separated by a comma.
The first row typically contains header data, followed by rows representing the header's values.

CSV files lack built-in tools for data retrieval or manipulation.
Spreadsheet applications can be used to work with CSV files.
Programming languages like Python and R are commonly used to manipulate large datasets.

Data is stored in tables within databases, and queries are used to extract data.
Each table features a key for unique identification of data.
Each table has a primary key, a unique identifier for each record (like a student ID or customer number).
Related tables can be grouped together for analysis.
Databases offer more reliability due to easy data backups.
Data manipulations are more straightforward compared to CSV files.
A query is a request for data or information from a database, typically written in a query language like SQL (Structured Query Language).

Metadata is data about data.
Example: A picture taken on a mobile phone is stored with data such as the date, time, size, device information, camera specifications, and location details.
Metadata is found everywhere (e.g., labels on books, product tags, movie tickets).
It is available with images, videos, text files, or any stored file.
Metadata does not affect the original data when changes are made to it.

Metadata provides sensible details about large datasets.
It simplifies searching, sorting, and grouping of data.
Metadata helps store data in a structured and organized way.
Data can be effectively used, and better analysis can be performed when stored along with metadata.

Choose a Topic from the following examples or propose your own:
- Healthy lifestyle habits among students
- Screen time and study performance
- Favorite social media platforms and usage patterns
- Opinions on school canteen food
- Environmental awareness and recycling behavior
Define the Purpose: Clearly state what you want to find out from your survey (e.g., “To understand how much time students spend on their phones after school”).
Plan the Survey Questions: Your form should include:
- 3–4 multiple-choice questions
- 1 rating scale or Like scale
Design the Form using a digital tool like Google Forms, Microsoft Forms, or any survey creator.
Submit the Following:
- A link to the form
- A short paragraph explaining: What kind of data you hope to collect? , How this data could help solve or understand a problem
Bonus challenge for bonus marks: Create a visual representation that analyzes one key aspect of your survey results, such as the ratings or most selected responses.

Collected data may contain incorrect information due to data entry errors, software errors, or lack of uniformity.
Correcting data before analysis is crucial.
The process of making data suitable for analysis is called cleaning data.

Course name inconsistencies (e.g., "Computer Science" recorded as both "CS" and "Comp Sc.").
Inconsistent date formats in the "Year of enrollment" column.
Misspellings (e.g., "Health Science" incorrectly spelled).
Impossible values (e.g., "Number of enrollments" exceeding maximum permitted enrollments).

Lack of uniformity in data collection.
Input errors by data entry operators or users.
Abbreviations, misspellings, or varying capitalizations in open fields.
Data cleaning involves standardizing abbreviations, spellings, and capitalizations to a single form without altering the meaning.