1/35
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Data requirements
There can be various sources of data for an organization. It is important to comprehend what type of data is required for the organization to be collected, curated, and stored. For example, an application tracking the sleeping pattern of patients suffering from dementia requires several types of sensors’ data storage, such as sleep data, heart rate from the patient, electro-dermal activities, and user activities pattern. All of these data points are required to correctly diagnose the mental state of the person. Hence, these are mandatory requirements for the application. In addition to this, it is required to categorize the data, numerical or categorical, and the format of storage and dissemination.
Data collection
Data collected from several sources must be stored in the correct format and transferred to the right information technology personnel within a company. As mentioned previously, data can be collected from several objects on several events using different types of sensors and storage tools.
Data processing
Preprocessing involves the process of pre-curating the dataset before actual analysis. Common tasks involve correctly exporting the dataset, placing them under the right tables, structuring them, and exporting them in the correct format.
Data cleaning
Preprocessed data is still not ready for detailed analysis. It must be correctly transformed for an incompleteness check, duplicates check, error check, and missing value check. These tasks are performed in the data cleaning stage, which involves responsibilities such as matching the correct record, finding inaccuracies in the dataset, understanding the overall data quality, removing duplicate items, and filling in the missing values. However, how could we identify these anomalies on any dataset? Finding such data issues requires us to perform some analytical techniques. To understand briefly, data cleaning is dependent on the types of data under study. Hence, it is most essential for data scientists or EDA experts to comprehend different types of datasets. An example of data cleaning would be using outlier detection methods for quantitative data cleaning.
EDA
as mentioned before, is the stage where we actually start to understand the message contained in the data. It should be noted that several types of data transformation techniques might be required during the process of exploration.
Modeling and algorithm
From a data science perspective, generalized models or mathematical formulas can represent or exhibit relationships among different variables, such as correlation or causation. These models or equations involve one or more variables that depend on other variables to cause an event.
Data Product
Any computer software that uses data as inputs, produces outputs, and provides feedback based on the output to control the environment is referred to as a ———. —————— is generally based on a model developed during data analysis, for example, a recommendation model that inputs user purchase history and recommends a related item that the user is highly likely to buy.
Communication
This stage deals with disseminating the results to end stakeholders to use the result for business intelligence. One of the most notable steps in this stage is data visualization. Visualization deals with information relay techniques such as tables, charts, summary diagrams, and bar charts to show the analyzed result.
4 steps
How many steps are there in EDA
8 stages
How many stages are there in EDA
Problem definition
Before trying to extract useful insight from the data, it is essential to define the business problem to be solved. The ——— works as the driving force for a data analysis plan execution. The main tasks involved in ——— are defining the main objective of the analysis, defining the main deliverables, outlining the main roles and responsibilities, obtaining the current status of the data, defining the timetable, and performing cost/benefit analysis. Based on such a ————, an execution plan can be created.
Data preparation
This step involves methods for preparing the dataset before actual analysis. In this step, we define the sources of data, define data schemas and tables, understand the main characteristics of the data, clean the dataset, delete non-relevant datasets, transform the data, and divide the data into required chunks for analysis.
Data analysis
This is one of the most crucial steps that deals with descriptive statistics and analysis of the data. The main tasks involve summarizing the data, finding the hidden correlation and relationships among the data, developing predictive models, evaluating the models, and calculating the accuracies. Some of the techniques used for data summarization are summary tables, graphs, descriptive statistics, inferential statistics, correlation statistics, searching, grouping, and mathematical models.
Development and representation of the results
This step involves presenting the dataset to the target audience in the form of graphs, summary tables, maps, and diagrams. This is also an essential step as the result analyzed from the dataset should be interpretable by the business stakeholders, which is one of the major goals of EDA. Most of the graphical analysis techniques include scattering plots, character plots, histograms, box plots, residual plots, mean plots, and others.
Descriptive and Inferential
What are the areas of statistics
Population
entire groups of units which is the focus of the study
Sample
small but representative fraction, portion, or subset of the population from which the information is collected
Parameter
measured characteristic of a population
Statistic
measured characteristic of a sample
Variable
characteristic observed or measured on every unit of the population
Observation
realized value of the variable
Data (or Data Set)
collection of all observations
Unit
Individual or object on which a variable is measured
Discrete and Continuous
What are the fields under Quantitative variables
Numerical Data
This data has a sense of measurement involved in it; for example, a person's age, height, weight, blood pressure, heart rate, temperature, number of teeth, number of bones, and the number of family members. This data is often referred to as ----------------- in statistics. The ------------------- can be either discrete or continuous types.
Discrete data
This is data that is countable and its values can be listed out. For example, if we flip a coin, the number of heads in 200 coin flips can take values from 0 to 200 (finite) cases. A variable that represents a -------------- is referred to as a ---------- variable. The ----------- variable takes a fixed number of distinct values. For example, the Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed. The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
Continuous Data
A variable that can have an infinite number of numerical values within a specific range is classified as continuous data. A variable describing ----------- data is a ------------- variable. For example, what is the temperature of your city today? Can we be finite? Similarly, the weight variable in the previous slide is a ----------- variable.
Categorical Data
This type of data represents the characteristics of an object; for example, gender, marital status, type of address, or categories of the movies. This data is often referred to as ----------- datasets in statistics. To understand clearly, here are some of the most common types of --------- you can find in data:
Binary categorical variable
can take exactly two values and is also referred to as a dichotomous variable. For example, when you create an experiment, the result is either success or failure. Hence, results can be understood as a -------------------.
Polytomous variables
are categorical variables that can take more than two possible values. For example, marital status can have several values, such as annulled, divorced, interlocutory, legally separated, married, polygamous, never married, domestic partners, unmarried, widowed, domestic partner, and unknown. Since marital status can take more than two possible values, it is a ----------------.
Nominal
These are practiced for labeling variables without any quantitative value. The scales are generally referred to as labels. And these scales are mutually exclusive and do not carry any numerical importance.
Frequency
is the rate at which a label occurs over a period of time within the dataset.
Proportion
can be calculated by dividing the frequency by the total number of events.
Ordinal
In ----------- scales, the order of the values is a significant factor. An easy tip to remember the -------- scale is that it sounds like an order.
Interval
The ------- scale is the 3rd level of measurement scale. It is defined as a quantitative measurement scale in which the difference between the two variables is meaningful. In other words, the variables are measured in an exact manner, not as in a relative way in which the presence of zero is arbitrary.
Ratio
----- scale is a type of variable measurement scale which is quantitative in nature. It allows any researcher to compare the intervals or differences. -----scale is the 4th level of measurement and possesses a zero point or character of origin. This is a unique feature of this scale. For example, the temperature outside is 0-degree Celsius. 0 degree doesn't mean it's not hot or cold, it is a value.