Data Analytics: Project Workflow and Essential Excel Skills
Understanding the Analytics Project Process
Structured Attack on Data: The process of data analysis is a structured attack, designed to be realistic and applicable to real-world market scenarios.
Overall Steps:
Data Collection & Getting Data Together: This involves gathering data, whether it's primary data (collected directly) or secondary data (pre-existing). The initial step is to consolidate all necessary data.
Data Exploration (Summary Statistics): Essential for analysts to understand the data they are about to analyze. This stage involves calculating various descriptive statistics to get a 'feel' for the data. Key questions asked include: Is it skewed? What are the mean, median, max, min, and standard deviation? Are there relevant subgroups?
Prediction Modeling (Advanced Models): This step involves using the prepared data to create predictions or build models. This can range from simpler techniques to complex machine learning applications.
Examples of Advanced Models: Regression analysis, logistic regression, machine learning, deep learning, classification models.
Communication of Analysis (Storytelling & Visualization): The final, crucial step is to effectively communicate the analytical findings to decision-makers. This involves storytelling and leveraging data visualization skills.
Importance of Visualization: Visuals (infographics, charts) are often more effective and easier for humans to comprehend than large pages of text, aiding in summarizing and communicating analysis powerfully.
Interconnectedness: All five steps (implied, as this section focuses on the last one) interact and are often performed together, or in stages, sometimes requiring data collection from external sources (e.g., websites).
Summary Statistics
Definition: Summary statistics are measures used to summarize data, especially large datasets. Instead of reviewing hundreds or millions of observations, a few key statistics can convey the essence of the data (e.g., the average of 100 values).
Measures of Central Tendency: These measures indicate the center or typical value of a distribution, showing where the largest mass of observations lies. In this course, the focus will be on these three:
Mean (Average): The sum of all values divided by the number of values.
Median: The middle value in a dataset when values are ordered from lowest to highest. It's less affected by extreme outliers.
Mode: The value that appears most frequently in a dataset.
Measures of Dispersion/Variation: These measures describe how spread out the data points are. High variation can make prediction difficult.
Variance: The average of the squared differences from the mean, indicating the spread of data points from the mean.
Standard Deviation: The square root of the variance, providing a measure of spread in the same units as the data.
Maximum (Max) and Minimum (Min): The highest and lowest values in a dataset, providing the range of the data.
Interquartile Range (IQR), Percentiles, Quartiles: These are other measures of dispersion that focus on specific segments of the data distribution, though not the primary focus of this initial discussion.
Caveats of Using Averages (The "Law of Averages" Flaw):
Averages can be misleading if there are extreme outliers or heavily skewed data. For example, a person might drown in a river that is "on average" only waist-deep if there's a deep trench (an outlier).
Example: US Household Income Distribution: The average income might be high, but the distribution is often skewed right, with most of the population having lower incomes and a small top 1% having a disproportionately large share of wealth. Reporting only the mean would not accurately represent the majority.
Example: Client Service Time: A company might meet its average service time target (e.g., 21 minutes vs. target of 30 minutes), but if 25% of customers are still waiting significantly longer, the average hides these unhappy outliers.
Outliers' Impact: Averages are extremely sensitive to outliers. It's often advisable to use the median, which is less affected by extreme values, especially with skewed data. In cases with many outliers, reporting the mean alone can be deceptive.
Higher Variation and Outliers: A larger number of outliers or extreme values typically leads to higher data dispersion or standard deviation, suggesting a wider range of values.
Practical Application: Beatrix Manager Scheduling Example:
To determine staffing needs (e.g., how many people to work at certain times), a manager would analyze average customer footfall at different times (central tendency).
They would also look at the standard deviation (variation) to understand how consistently busy those times are. For example, Sunday footfall might be more variable than a weekday lunchtime rush, requiring different scheduling strategies.
Data Distribution Visualization with Histograms
Purpose: While summary statistics condense data into single points, histograms visually represent the entire distribution of a single variable, showing the complete picture of the data.
Construction: Histograms are bar graphs that create different "buckets" (bins) of data and show how many observations fall into each bucket.
Predefined Distributions: Statistics includes predefined distributions like the normal distribution and the t-distribution (which are related). Data can often be categorized as following one of these distributions.
Visualizing Skewed Data: Histograms clearly show if data is skewed. A left-skewed (negatively skewed) distribution has more observations on the right tail, and a right-skewed (positively skewed) distribution has more observations on the left tail.
Revealing Subgroups and Variation: Histograms can reveal hidden structures in data.
Example: Salary Distribution: A bimodal distribution (two peaks) in a salary histogram might indicate two distinct groups within the population, e.g., younger, less experienced people with lower, less varied salaries, and older, more experienced people with higher, more varied salaries.
Comparing two histograms can quickly highlight differences in central tendency and dispersion (e.g., one group's salaries ranging from \10\k vs. another from \50\k).
Univariate Analysis using Summary Statistics: When analyzing a single variable (one column of data), such as US household income or customer wait times:
Central Tendency Measures: Mean, median, mode.
Variation Measures: Maximum, minimum, standard deviation, variance.
Visual Measure: Histogram.
Relationships Between Variables: Correlation
Definition: Correlation describes the linear relationship between two variables. It indicates whether two variables tend to move together.
Linearity is Key: Correlation is only a good measure for linear relationships. It's not suitable for nonlinear relationships.
Directional Relationship Only: Correlation only tells you that two variables move together; it does not indicate causation (one variable causing the other).
Covariance: A similar measure to correlation, but correlation is generally more interpretable due to its standardized scale.
Properties of Correlation (Pearson Correlation Coefficient):
Range: Always between and .
Zero Correlation (): The two variables are unrelated (no linear relationship).
Negative Correlation (e.g., to ): Variables move in opposite directions. For example, if one increases, the other decreases.
Positive Correlation (e.g., to ): Variables move in the same direction. If one increases, the other tends to increase.
Strength of Relationship: The magnitude of the correlation coefficient indicates strength:
Strong Correlation: Generally considered or .
Weak Correlation: Typically or .
Correlation Visualization: Scatter Plots
Purpose: Scatter plots are the visual representation for studying the correlation between two variables, just as histograms are for single variables.
Construction: One variable (e.g., ) is plotted on the y-axis, and another variable (e.g., ) is plotted on the x-axis. Each point on the plot represents a pair of values from the dataset.
Interpretation of Patterns:
Positive Relationship (Upward Sloping): If the points generally trend upwards from left to right, it suggests a positive correlation. As one variable increases, the other tends to increase.
No Relationship (Scattered Points): If points are widely scattered without a discernible pattern, it indicates a weak or no correlation (e.g., or ).
Negative Relationship (Downward Sloping): If the points generally trend downwards from left to right, it suggests a negative correlation. As one variable increases, the other tends to decrease.
Example: Shell vs. Ford Stock Prices:
Plotting Shell stock prices (y-axis) against Ford stock prices (x-axis) showed a correlation of .
This is a weak positive correlation (0.3 < 0.7). The stocks tend to move in the same direction, but not strongly in close alignment.
In finance, understanding positive or negative correlation between assets is crucial for trading strategies.
Correlation Does Not Mean Causation
Critical Distinction: This is a fundamental principle in data analysis. Just because two variables are correlated does not mean one causes the other.
Example: Shell vs. Ford Stocks: A correlation of between Shell and Ford stock prices does not mean an increase in Ford prices causes an increase in Shell prices (or vice-versa). They simply move somewhat together, influenced by other market factors.
Spurious Correlations: Many examples exist where variables show high correlation but have no causal link.
Example 1: Space Launches and Sociology Doctorates: Worldwide non-commercial space launches and the number of sociology doctorates awarded in the US might appear correlated, but it's highly improbable that one causes the other. Their co-movement is purely coincidental ("just happens").
Example 2: Milk Consumption and Electricity Usage: Historical data might show a high correlation between milk consumption and electricity usage in a city. While they move together, it's not a causal link, and building a prediction model solely on this correlation would be flawed.
Confounding Variables: Often, an observed correlation between two variables is actually due to a third, unobserved, confounding variable that influences both.
Example: Lighters and Lung Cancer: People who carry lighters tend to get lung cancer. Carrying a lighter does not cause lung cancer. The confounding variable is smoking. People who carry lighters are more likely to smoke, and smoking causes lung cancer.
Implications: When analyzing correlation, analysts should only state movement or relationship strength. Making causal statements requires different (causal or prescriptive) modeling techniques beyond simple correlation.
Excel Basics: Concepts and Practical Application
Introduction to Excel for Data Exploration: Excel is a fundamental tool for implementing and understanding data analysis concepts firsthand.
Key Excel Terminology:
Workbook: The entire Excel file (e.g.,
module3.xlsx).Worksheet: Individual tabs within a workbook (e.g., "Getting Started", "Data Description", "Fictional Data"). One workbook can contain multiple worksheets.
Ribbon: The entire top bar where all menu options and commands are located (e.g., Home, Insert, Page Layout).
Tabs: Main categories on the ribbon (e.g., Home, File, Insert).
Groups: Sub-sections within a tab that organize related commands (e.g., 'Clipboard', 'Font', 'Number' within the Home tab).
Title Bar: The very top bar of the Excel window, displaying the file name.
Scroll Bars: Located on the right and bottom, allowing movement through rows and columns not currently visible.
Status Bar: Located at the bottom of the Excel window, providing quick summary statistics (average, count, sum) of selected cells.
Quick Access Toolbar: Customizable toolbar at the very top, to the left of the title bar, for frequently used commands.
Formula Bar: Displays the content of the active cell. This is where formulas (always starting with
=) are entered and edited (e.g.,=AVERAGE(range),=STDEV(range)).Cell Reference: Identifies a specific cell by its column letter and row number (e.g.,
F4).Range: A selection of multiple cells (e.g.,
A1:G100).Active Cell: The currently selected and highlighted cell where data can be entered or modified.
Crucial File Saving Practice: Always save Excel files in the
.xlsxformat to ensure all worksheets and work are preserved. Saving in other formats (e.g.,.csv) might only save one worksheet.Practical Excel Tasks (Using a fictional movie dataset for "Cinema"):
Data Description: Understanding the variables in a dataset is the first step. The example movie dataset includes
Title,Year,Genre,Audience_Ratings,Box_Office_Revenue_Millions,Studio_Houses,Movie_Length_Minutes, andProduction_Budget_Millions.Moving Columns: Columns can be moved by selecting the column, holding the
Shiftkey, hovering over the column border until a four-directional arrow appears, and then dragging and dropping to the desired location. Alternatively, cut-and-paste can be used.Formatting Values:
Currency: Select the column (e.g.,
Box_Office_Revenue_Millions), go to the 'Number' group on the Home tab, and select 'Currency' from the dropdown menu to apply currency symbols (e.g., \) and decimal places.Decimals: Use the 'Increase Decimal' or 'Decrease Decimal' buttons within the 'Number' group to adjust the number of decimal places for selected cells. This improves readability.
Scientific Notation: Sometimes numbers appear in scientific notation (e.g., ). To change this, select the cells and choose 'General' or 'Number' format from the number dropdown in the Home tab.
Sorting Data:
Sorting requires caution to avoid breaking the dataset (e.g., mismatching revenues with their original movie titles).
Best Practice: Select the entire dataset (by clicking the top-left corner box between column
Aand row1), then go to 'Sort & Filter' in the 'Editing' group on the Home tab, and choose 'Custom Sort'. This allows sorting by specific columns while keeping rows intact.The example sorted
Box_Office_Revenue_Millionsfrom smallest to largest, which correctly rearranged all associated data.
Filtering Data:
Useful for analyzing specific subgroups within the data.
How to Apply: Select the header row (the row containing column names), then go to 'Sort & Filter' in the 'Editing' group and click 'Filter'. This adds dropdown arrows to each column header.
Filtering by Value: Click the dropdown arrow on the desired column (e.g.,
Studio_Houses), uncheck 'Select All', then select the specific value(s) to filter by (e.g., 'Galaxy Films'). This will display only the rows matching the filter criteria.
Continuous Learning: Many Excel functions and shortcuts exist; active practice and exploration are key to becoming proficient.