CS 103 - Data Science Final Exam

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/35

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No study sessions yet.

36 Terms

New cards

Data requirements

There can be various sources of data for an organization. It is important to comprehend what type of data is required for the organization to be collected, curated, and stored. For example, an application tracking the sleeping pattern of patients suffering from dementia requires several types of sensors’ data storage, such as sleep data, heart rate from the patient, electro-dermal activities, and user activities pattern. All of these data points are required to correctly diagnose the mental state of the person. Hence, these are mandatory requirements for the application. In addition to this, it is required to categorize the data, numerical or categorical, and the format of storage and dissemination.

New cards

Data collection

Data collected from several sources must be stored in the correct format and transferred to the right information technology personnel within a company. As mentioned previously, data can be collected from several objects on several events using different types of sensors and storage tools.

New cards

Data processing

Preprocessing involves the process of pre-curating the dataset before actual analysis. Common tasks involve correctly exporting the dataset, placing them under the right tables, structuring them, and exporting them in the correct format.

New cards

Data cleaning

Preprocessed data is still not ready for detailed analysis. It must be correctly transformed for an incompleteness check, duplicates check, error check, and missing value check. These tasks are performed in the data cleaning stage, which involves responsibilities such as matching the correct record, finding inaccuracies in the dataset, understanding the overall data quality, removing duplicate items, and filling in the missing values. However, how could we identify these anomalies on any dataset? Finding such data issues requires us to perform some analytical techniques. To understand briefly, data cleaning is dependent on the types of data under study. Hence, it is most essential for data scientists or EDA experts to comprehend different types of datasets. An example of data cleaning would be using outlier detection methods for quantitative data cleaning.

New cards

EDA

as mentioned before, is the stage where we actually start to understand the message contained in the data. It should be noted that several types of data transformation techniques might be required during the process of exploration.

New cards

Modeling and algorithm

From a data science perspective, generalized models or mathematical formulas can represent or exhibit relationships among different variables, such as correlation or causation. These models or equations involve one or more variables that depend on other variables to cause an event.

New cards

Data Product

Any computer software that uses data as inputs, produces outputs, and provides feedback based on the output to control the environment is referred to as a ———. —————— is generally based on a model developed during data analysis, for example, a recommendation model that inputs user purchase history and recommends a related item that the user is highly likely to buy.

New cards

Communication

This stage deals with disseminating the results to end stakeholders to use the result for business intelligence. One of the most notable steps in this stage is data visualization. Visualization deals with information relay techniques such as tables, charts, summary diagrams, and bar charts to show the analyzed result.

New cards

4 steps

How many steps are there in EDA

New cards

8 stages

How many stages are there in EDA

New cards

Problem definition

Before trying to extract useful insight from the data, it is essential to define the business problem to be solved. The ——— works as the driving force for a data analysis plan execution. The main tasks involved in ——— are defining the main objective of the analysis, defining the main deliverables, outlining the main roles and responsibilities, obtaining the current status of the data, defining the timetable, and performing cost/benefit analysis. Based on such a ————, an execution plan can be created.

New cards

Data preparation

This step involves methods for preparing the dataset before actual analysis. In this step, we define the sources of data, define data schemas and tables, understand the main characteristics of the data, clean the dataset, delete non-relevant datasets, transform the data, and divide the data into required chunks for analysis.

New cards

Data analysis

This is one of the most crucial steps that deals with descriptive statistics and analysis of the data. The main tasks involve summarizing the data, finding the hidden correlation and relationships among the data, developing predictive models, evaluating the models, and calculating the accuracies. Some of the techniques used for data summarization are summary tables, graphs, descriptive statistics, inferential statistics, correlation statistics, searching, grouping, and mathematical models.

New cards

Development and representation of the results

This step involves presenting the dataset to the target audience in the form of graphs, summary tables, maps, and diagrams. This is also an essential step as the result analyzed from the dataset should be interpretable by the business stakeholders, which is one of the major goals of EDA. Most of the graphical analysis techniques include scattering plots, character plots, histograms, box plots, residual plots, mean plots, and others.

New cards

Descriptive and Inferential

What are the areas of statistics

New cards

Population

entire groups of units which is the focus of the study

New cards

Sample

small but representative fraction, portion, or subset of the population from which the information is collected

New cards

Parameter

measured characteristic of a population

New cards

Statistic

measured characteristic of a sample

New cards

Variable

characteristic observed or measured on every unit of the population

New cards

Observation

realized value of the variable

New cards

Data (or Data Set)

collection of all observations

New cards

Unit

Individual or object on which a variable is measured

New cards

Discrete and Continuous

What are the fields under Quantitative variables

New cards

Numerical Data

This data has a sense of measurement involved in it; for example, a person's age, height, weight, blood pressure, heart rate, temperature, number of teeth, number of bones, and the number of family members. This data is often referred to as ----------------- in statistics. The ------------------- can be either discrete or continuous types.

New cards

Discrete data

This is data that is countable and its values can be listed out. For example, if we flip a coin, the number of heads in 200 coin flips can take values from 0 to 200 (finite) cases. A variable that represents a -------------- is referred to as a ---------- variable. The ----------- variable takes a fixed number of distinct values. For example, the Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed. The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.

New cards

Continuous Data

A variable that can have an infinite number of numerical values within a specific range is classified as continuous data. A variable describing ----------- data is a ------------- variable. For example, what is the temperature of your city today? Can we be finite? Similarly, the weight variable in the previous slide is a ----------- variable.

New cards

Categorical Data

This type of data represents the characteristics of an object; for example, gender, marital status, type of address, or categories of the movies. This data is often referred to as ----------- datasets in statistics. To understand clearly, here are some of the most common types of --------- you can find in data:

New cards

Binary categorical variable

can take exactly two values and is also referred to as a dichotomous variable. For example, when you create an experiment, the result is either success or failure. Hence, results can be understood as a -------------------.

New cards

Polytomous variables

are categorical variables that can take more than two possible values. For example, marital status can have several values, such as annulled, divorced, interlocutory, legally separated, married, polygamous, never married, domestic partners, unmarried, widowed, domestic partner, and unknown. Since marital status can take more than two possible values, it is a ----------------.

New cards

Nominal

These are practiced for labeling variables without any quantitative value. The scales are generally referred to as labels. And these scales are mutually exclusive and do not carry any numerical importance.

New cards

Frequency

is the rate at which a label occurs over a period of time within the dataset.

New cards

Proportion

can be calculated by dividing the frequency by the total number of events.

New cards

Ordinal

In ----------- scales, the order of the values is a significant factor. An easy tip to remember the -------- scale is that it sounds like an order.

New cards

Interval

The ------- scale is the 3rd level of measurement scale. It is defined as a quantitative measurement scale in which the difference between the two variables is meaningful. In other words, the variables are measured in an exact manner, not as in a relative way in which the presence of zero is arbitrary.

New cards

Ratio

----- scale is a type of variable measurement scale which is quantitative in nature. It allows any researcher to compare the intervals or differences. -----scale is the 4th level of measurement and possesses a zero point or character of origin. This is a unique feature of this scale. For example, the temperature outside is 0-degree Celsius. 0 degree doesn't mean it's not hot or cold, it is a value.