JB

Data Analytics Fundamentals - Comprehensive Notes

Week 3: Types of Data Analytics

Introduction to Analytics

  • Analytics has been used since the late 19th century to improve efficiency.

  • Increased use in the 1960s when computers became central to organizational decision support systems.

  • Traditionally, business managers used qualitative decision-making based on past experiences or rules of thumb.

  • The proliferation of big data has forced industries to develop particular types of data analytics.

  • Implement the DIKW (Data, Information, Knowledge, Wisdom) approach to make sense of complex data and reach the point of wisdom.

Types of Data Analytics

  • Four main types:

    • Descriptive

    • Diagnostic

    • Predictive

    • Prescriptive

  • These types vary in complexity and value.

  • Descriptive answers "What happened?" (Past).

  • Diagnostic answers "Why did it happen?" (Past).

  • Predictive answers "What will happen?" (Future).

  • Prescriptive answers "How can we make it happen?" (Future).

Descriptive Analytics

  • Describes or summarizes raw data into a human-interpretable form.

  • Examines past behaviors to identify patterns or trends and understand how these might influence future outcomes.

  • Uses statistics such as count, min, max, sum, average, percentage, percent change, etc.

Measures of Central Tendency
  • Central tendency: A measure of the central value in a dataset.

    • Mean: Average of values in a dataset.

      • Example: Mean = (90 + 94 + 53 + 68 + 79 + 94 + 53 + 65 + 87 + 90 + 70 + 69 + 65 + 89 + 85 + 53 + 47 + 61 + 27 + 80)/20 = 79.95

    • Median: Middle number when values are arranged size-wise.

      • Example: 27, 47, 53, 53, 53, 61, 65, 65, 68, 69, 70, 79, 80, 85, 87, 89, 90, 90, 94, 94. Median = (69 + 70)/2 = 69.5

    • Mode: Most frequently occurring value in a dataset.

      • Example: Mode = 53

  • Percentile: Percentage of values below a particular value; median corresponds to the 50th percentile.

  • Quartile: Divides ordered dataset into four equal groups.

    • Q1 (first quartile): 25th percentile

    • Q2: Median

    • Q3 (third quartile): 75th percentile

Measures of Dispersion
  • Dispersion: Quantitative measure of the spread of a distribution.

  • Indicates whether values are situated around the central value or spread out.

    • Range: Difference between the lowest and highest values in a dataset.

    • Interquartile Range (IQR): Difference between the third quartile and the first quartile. IQR = Q3 - Q1

    • Variance: Measures how scattered values are around the mean. A small variance indicates that the mean is an appropriate measure of central tendency.

    • Standard Deviation: Square root of the variance, bringing it to the same units as the data.

    • Example: For data = {1, 2, 3, 4, 5, 6}

      • Range = 6 - 1 = 5

      • Mean = (1+2+3+4+5+6)/6 = 3.5

      • Variance = Ī£ (xi – xĢ…)^2/n = ā…™ (6.25+2.25+0.25+0.25+2.25+6.25) = 2.917

      • Standard deviation = √2.917 = 1.708

Diagnostic Analytics

  • Examines data to answer "Why did it happen?"

  • Determines the causes of trends and relationships between variables.

  • Provides crucial information for data-driven decision-making.

  • Techniques:

    • Drill down

    • Data mining

    • Correlation analysis

    • Causal analysis

  • Requires intensive work and often human intervention.

  • Associated with root-cause analysis to understand the factors and events causing an event.

  • Generally follows the scientific method:

    1. Develop a hypothesis.

    2. Determine appropriate diagnostic analytics techniques.

    3. Collect and analyze data to test the hypothesis.

    4. Draw conclusions based on the analysis.

Drill Down Analysis
  • Explores data at increasingly granular levels to uncover insights and patterns.

  • Enables users to visualize lower levels of hierarchical data in one chart by nesting additional variables.

  • Typically interactive; clicking on a data point drills down to the next level of detail.

  • Example: Analyzing sales data by region, then by product category, and then by individual product.

Data Mining
  • Includes gathering and preparation.

  • Followed by data analysis and interpretation.

Correlation Analysis
  • Measures the relationship between two or more variables.

  • Determines how changes in one variable are associated with changes in another.

  • Identifies the strength and direction of the relationship.

  • Statistical Rules:

    • Requires a large sample size for reliable analysis.

    • Assumes a linear relationship between variables (use regression for non-linear relationships).

    • Significance level (e.g., 0.05) indicates the probability of obtaining a correlation coefficient as extreme as the observed value, assuming no correlation.

    • Identify and deal with outliers as they can significantly impact the correlation coefficient.

  • Correlation Coefficient: Ranges from -1 to +1.

    • +1: Perfect positive correlation.

    • Close to +1: Strong positive correlation (e.g., +0.8).

    • Close to 0: No correlation.

    • Close to -1: Strong negative correlation (e.g., -0.8).

    • -1: Perfect negative correlation.

  • Correlation does not imply causation; further analysis is needed to establish causality.

Causal Analysis
  • Aims to understand cause-and-effect relationships between variables.

  • Guiding Principles for Establishing Causality:

    • Association: Establish an association between the variables (e.g., using correlation coefficients).

    • Temporal precedence: The cause must precede the effect in time.

    • Non-spuriousness: The relationship should not be the result of a third variable influencing both cause and effect; address this using randomization or statistical control.

    • Consistency: The relationship should hold true across different contexts, populations, and time periods.

  • Establishing causality is challenging and requires statistical analysis, carefully formulated hypotheses, and domain expertise.

Types of Data

  • Understanding different data types is essential for selecting appropriate models and techniques.

  • Data is the raw material to be processed to extract information and gain a deeper understanding.

Categories of Data
  • Categorical (Qualitative):

    • Nominal: Named categories with no intrinsic order (e.g., gender, favorite ice cream flavor).

    • Ordinal: Categories with a predefined order (e.g., education level, customer satisfaction, economic status).

  • Numerical (Quantitative):

    • Discrete: Specific values that cannot be subdivided (e.g., number of students in a class, patients in a hospital).

    • Continuous: Can assume any numeric value and can be meaningfully split into smaller parts (e.g., weight, height, time).

  • Interval: Counted, distinct, meaningful differences, no true zero (e.g., temperature in Celsius).

  • Ratio: Counted, distinct, meaningful differences, absolute zero defined (e.g., height).

Predictive Analytics

  • Makes predictions or estimations about the likelihood of future events based on historical data.

  • Provides insight into "What might happen?"

  • Techniques: data mining, statistics, modeling, Machine Learning (ML), and Artificial Intelligence (AI).

  • Machine Learning heavily focuses on predictive analytics, combining historical data from multiple sources to identify patterns.

  • Uses a statistical model or algorithm to capture relationships between datasets and predict the likelihood of an event.

Prescriptive Analytics

  • Measures the effect of a future decision to enable decision-makers to foresee possible outcomes.

  • Incorporates any combination of the other types of data analytics.

  • Combines business rules, domain knowledge, and underlying analysis techniques and tools.

  • Objective: Predict what will happen and explain why by predicting multiple futures based on different scenarios.

  • Allows companies to assess possible outcomes based on their actions.

  • Provides recommendations (prescriptions) supported by evidence and aligned with business case objectives.

  • Provides a big picture of why a problem happened, potential consequences if not addressed, and actionable insights for stakeholders.

Regression Analysis

  • Statistical technique to model the relationship between a dependent variable and one or more independent variables.

  • Types of Regression Models:

    • Linear Regression: Models the relationship with a straight line to predict continuous outcomes.

    • Logistic Regression: Used for binary or categorical dependent variables and classification tasks.

    • Polynomial Regression: Models non-linear relationships between variables using a polynomial equation.

  • Classification vs. Regression:

    • Classification: Predicting discrete class labels (e.g., email as spam or non-spam).

    • Regression: Predicting a continuous quantity (e.g., stock prices over time).

Linear Regression
  • Finds the relationship between dependent and independent variables.

  • Establishes a relationship between input and output by plotting the best-fit line.

  • Independent variable: input (X-axis).

  • Dependent variable: output (Y-axis).

  • Helps find the values of the missing variable.

Logistic Regression
  • Classification algorithm used to predict the probability of occurrence of the dependent variable given the information of the independent variable.

  • Output is 0 or 1.

  • Applications: spam detection, customer choice prediction, cancer detection, etc.

Comparison of Regression Types
  • Linear vs. Logistic Regression: Logistic regression is used when the output is a probability between 0 and 1.

  • Linear vs. Polynomial Regression: Polynomial regression is used data isn't linear and more complex.

Bayes Theorem

  • Describes the probability of an event based on conditions related to the event, using conditional probability.

  • Theorem: P(A|B) = \frac{P(B|A)P(A)}{P(B)}

    • P(A|B): Probability of event A, given event B has occurred.

    • P(B|A): Probability of event B, given event A has occurred.

    • P(A): Probability of event A.

    • P(B): Probability of event B.

  • Applications: statistical modeling and inference, machine learning algorithms (e.g., Naive Bayes), medicine (determining medical test result accuracy).

  • Example: If picking a card, knowing it is a diamond, what is the probability of that card being a queen?
    P(Queen|Diamond) = \frac{1}{13}
    Given:
    Total number of cards = 13 (in diamonds suit)
    Number of queens = 1

  • Calculation Example:

    • P(cloudy) = 0.40

    • P(rain) = 0.20

    • P(cloudy | rain) = 0.85

    • P(rain | cloudy) = \frac{P(rain) * P(cloudy | rain)}{P(cloudy)} = \frac{0.20 * 0.85}{0.40} = 0.425

Health Informatics

  • Uses information technology, communication, and healthcare knowledge to enhance patient outcomes.

  • Principles for Data Analysts in Health Informatics:

    1. Quality of the Data: Cleaning, ensuring integrity, and reporting biases or errors.

    2. Domain Knowledge: Understanding medical terminology, clinical processes, and regulations.

    3. Application of Techniques: Utilizing appropriate data analysis tools for diverse data types.

    4. Ethical Principles: Addressing consent, data anonymization, and bias in EDA, visualizations, and recommendations.

    5. Visualization and Communication: Selecting appropriate visualizations to present results meaningfully and non-biasedly.

  • Example Case: Target’s Predictive Analytics (predicting pregnancies in customers for targeted advertising in 2012).

    • Data Source: Personal identifiable information (PII) from baby registries and customer purchase data.

    • Identifying Pregnancy Trends: Women are buying large quantities of unscented lotion and calcium, magnesium, and zinc.

    • Creating Pregnancy Prediction Score: Using purchase types and volume of ~25 products.

    • Insights: Target’s predictive analytics allowed them to anticipate customer needs and provide targeted advertising.

Statistics in Python

  • The statistics library can perform basic descriptive statistics.

  • Example:
    import statistics x=[1,2,3,4,5] m=statistics.mean(x) n=statistics.median(x) s=statistics.stdev(x) print('mean=', m,'median=',n, 'standard deviation=',s)

Statistical Analysis with NumPy

  • NumPy provides a basis for several statistical techniques.

  • corrcoef() function returns a matrix of Pearson correlation coefficients.

  • Pearson correlation coefficient measures the linear association between variables (ranging from -1 to 1).

  • Example:
    import numpy as np x = np.array([1, 2, 3, 4, 5]) y = np.array([2, 4, 5, 4, 5]) correlation_matrix = np.corrcoef(x, y) print(correlation_matrix)

  • Linear Regression with scikit-learn

    import numpy as np
    from sklearn.linear_model import LinearRegression
    
    x = np.array([1, 2, 3, 4, 5]).reshape((-1, 1))
    y = np.array([2, 4, 6, 8, 10])
    
    model = LinearRegression()
    model.fit(x, y)
    
    x_new = np.array([6, 7, 8, 9, 10]).reshape((-1, 1))
    y_pred = model.predict(x_new)
    print(y_pred)
    

Data Preparation - Introduction

  • Phase 3 of the data analytics lifecycle is one of the most important (and time-consuming) activities.

  • Involves preparing data for analysis through data inspection and data cleaning.

Data Inspection

  • Ensures that all collected data meets business case needs and has enough reliability and consistency for data preparation.

  • Steps:

    1. Load the data: Access and import data into the analytics tool.

    2. View the data: Understand the structure, attributes, format, and quantity.

    3. Verify data quality: Check for data quality warnings and issues.

    4. Debugging: Troubleshoot and resolve identified issues or discrepancies.

    5. Documentation: Keep records and documentation of findings and observations.

  • Documentation is essential for future analysis, communicating insights, and data analytics reports.

  • Data inspection lays the foundation for the data preparation phase.

Indexing

  • Indexing is fundamental to Pandas, making retrieval and access to data much faster.

  • Two types of indexes:

    • Row index (vertical) with labels attached to rows.

    • Column index with labels (column names) for every column.

  • dataframe.set_index['Column Name'] - sets an index using an existing column.

  • reset_index() - resets to the original index.

  • Setting an index improves the speed at which we can retrieve data.

  • The index object is immutable.

Data Integration

  • Data may come from a combination of sources (e.g., Excel spreadsheets, CSV files, JSON files).

  • Data may also have different formats.

  • Common data integration techniques:

    • Joining and Merging: Combining data from different DataFrames based on a common key column or index.

    • Concatenating: Appending data along a specific axis (rows or columns).

    • Combining: Combining DataFrames while handling overlapping data and filling missing values.

  • Data integration is essential to create a unified dataset that can be easily analyzed and processed.

Joining and Merging
  • join() combines two DataFrames that have the same index or overlapping columns and performs a left join by default.

  • merge() combines two DataFrames based on one or more common columns.

    • Allows specifying columns to join on, the type of join to perform (inner, outer, left, or right), and how to handle any missing values.

  • Example:
    import pandas as pd # Creating Dictionary d = {'id': [1, 2, 10, 12], 'val1': ['a', 'b', 'c', 'd']} a = pd.DataFrame(d) a

  • Merge() function connects the rows in a DataFrame based on one or more keys

Concatenating
  • Combines two or more pandas objects (Series or DataFrames) along a particular axis.

  • Vertically or horizontally stacks the objects to create a new object with the combined data.

  • NumPy uses the concatenate() function for arrays, e.g., np.concatenate([array1, array2], axis=1).

  • Pandas uses the concat() function for objects.

    • pd.concat(objs, axis=0, join='outer', ignore_index=False)

      • objs: Sequence or mapping of pandas objects to concatenate.

      • axis: Specifies the axis along which the concatenation should occur (0 for vertical, 1 for horizontal).

      • join: Determines how the objects should be joined ('outer' for union, 'inner' for intersection).

      • ignore_index: Determines whether to ignore the original indexes (True to assign new integer indexes).

Combining
  • The pandas.DataFrame.combine_first() function can connect overlapping data by taking data from another structure.

  • Example:

import pandas as pd

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Age': [35, 40]})

combined_df = pd.concat([df1, df2])
display(combined_df)

Data Transformation

  • Involves cleaning, preprocessing, and transforming data before modeling or analysis.

  • Includes dealing with missing values, converting data types, and scaling or normalizing data.

  • Feature engineering: Creating new features from existing data.

  • Dimensionality reduction: Reducing the number of features.

  • Encoding categorical variables: Converting categorical variables into numerical values.

  • Modifying a DataFrame in Pandas is essential to enhance readability and select important elements for analysis.

  • Example using a Covid dataset (converting data from one format to another).

    • Initial Covid Dataset: Load CSV of Covid data

  • Steps:
    1. Load the dataset, setting the dateRep column as the index.
    data = pd.read_csv(filename)
    2. Print the top 5 rows
    data.head()

Modifying Columns - Headings

  • Column names can be changed using the rename() method:

    • df.rename(columns={'old name': 'new name'}, inplace=True)
      Example:

import pandas as pd
df = pd.read_excel('Raj.xlsx')
df

df.rename(columns={
    'First Name': 'FN',
    'Last Name': 'LN',
    'Gender':  'Sex',
    'Country': 'Nation',
    },
    inplace=True
)

df

Value Counts

  • value_counts() method counts occurrences of a specific value in a column (EDA tool).

  • Example:
    data['Continent'].value_counts()
    Can also specifiy a value in a column:

data['Countries/Territories'].valuecounts()['Australia'] valuecounts() on string data will treat different spelling or capitalization of words as different values.
Use the str.lower() if different spellings appear.
whitespace can be handled by methods such as str. strip().

The unique() function

  • Returns each unique value in the specified column.
    Example:

data['Countries/Territories'].unique()

Other str() functions

  • str.lower()

  • str.upper()

  • str.strip()
    Can be used to format strings

Changing Data Types

  • Data types of columns in a DataFrame can be changed using astype() method:

    • df['column name'] = df['column name'].astype('new data type')

  • The method data.dtypes.value counts() will return a count of the columns that belong to each data type.
    We can also use the select dtypes method to find columns specifically based on a type of data,.

  • Find all int dtypes with:
    data.select_dtypes(include='int')

Drop the column

Drop multiple columns using
data.drop(['Continent','geoId','Country Code'],axis='columns')

Data.columns to return the index of each column
  • Can be used to find the last column and index its value

data.drop(data.columns[-2],axis='columns')
data.drop(data.columns[:-5],axis='columns') Will drop column -5