D

EDA

When we get some data or dataset, one would think that it’s inherently obvious to observe the data. To see what answers lie within.

Usually when we are in a position that we must or want to explore data, then most would agree that we also have the abilities and knowledge, at least at a basic level, of how to answer questions we may have about the data. However,

However, there is often a missed step in this type of “inherent” understanding of looking through data. It can work but often things are overlooked, not fully understood or misinterpreted due to biases in our perspective, incomplete knowledge of the dataset, or lack of proper analytical techniques. What becomes less inherent is understanding that different types of data require differing interpretability and methodologies for uncovering truth or patterns within that data.

This is where EDA (Exploratory Data Analysis) would come in handy. By systematically summarizing the dataset's main characteristics, using a combination of graphical and non-graphical methods, EDA helps uncover patterns, spot anomalies, and test hypotheses, thus enhancing our understanding before diving into more intricate analyses.

When we explore data, we aim to uncover patterns, identify anomalies, and gain insights that can inform our decision-making process. EDA can help us with this as it is the first step into any Machine Learning algorithm that we may conspire to create, ensuring that we have a solid grasp of the underlying structure and relationships within the data. By conducting EDA, we can assess the quality of our data, identify potential outliers, and understand the distributions of different features, which ultimately leads to better model performance and more accurate predictions.

EDA is very similar to painting a picture of what story the data will tell. But remember… a picture is worth 1000 words, and a simple change on a facial expression, can change the words the picture speaks.

Statistics

In order to properly understand EDA, we must first understand the importance of statistics and the differences that lie therein.

  • Descriptive Stats: Reporting facts (i.e. “The average score is 75”.

  • Inferential Stats: Drawing conclusions from a data sample and making predictions about a population based on that sample (i.e. "Based on our sample, we can predict that 60% of the population will get around a score of 75.").

What is Statistics?

Have you ever had a question in your head? Ah—, I’m sure you have. How would you normally go about finding the answer to the question? At what point in the discovery process of that answer are you satisfied? I guess each person has their own ‘breaking point’ of satisfaction. (Outliers being nihilists and existentialists).

Some questions are easier to answer than others.

Asking, “Is this cup red?” is a very different question than asking, “how many rhino’s this year will give birth to male rhinos in Africa?” Obviously, starting from 0 for either question is difficult but luckily for questions that are looking for a probabilistic answer, we have a tool at our disposal called “Statistics”.

So now, statistics, can’t answer every question for us, but there’s a simple vein of questioning that it is affective with.

  • Why do people eat fast food?

  • Is there a correlation to how much a person eats fast food and how early they die?

  • Does studying 2 hours a day actually yield better employment results, whether faster or higher paying, than studying less hours per day or rather, none at all?

Two Meanings of Statistics

  1. The field of statistics: the study and practice of collecting and analyzing data.

  2. Facts about or summaries of data.

There’s two different types of statistics:

  • Descriptive: The more simple of the two, looking at the raw data an understanding summaries of it.

    • Mean, Median, Mode, Max, Central Tendency, Spread, Distribution, etc…

    • The average test score for the test is a 75.

    • Describes what the data shows.

    • You want to drink less coffee but you want to continue to feel as awake, how can you figure out how to solve this?

  • Inferential: Is taking our findings from descriptive statistics and making assumptions, hypothesis about it

    • In the coming year, 60% of people should achieve somewhere around a 75 on the test.

There will always be some degree of uncertainty when it comes to statistics and answering questions.

Statistics is like a superhero whose batcall is uncertainty, and tagline is "When you don’t know for sure, but doing nothing isn’t an option."

Statistics as a Tool 🧰

  • Statistics help us make sense of vast amounts of information.

  • They filter data, similar to how our eyes and ears filter stimuli.

    • Descriptive statistics make data more digestible.

    • Inferential statistics help us make decisions about data when there’s uncertainty.

  • Statistics help us reason, but don't reason for us.

  • Statistics, like chainsaws, are useless or dangerous without understanding how they work.

Statistics Done Poorly 🤕

  • Statistics done poorly can lead to silly conclusions.

  • Chainsawing done poorly leads to injuries.

    • Example: 36,000 chainsaw injuries per year in the US

      • 81% are lacerations

      • 95% of those hurt are male

Applications of Statistics

Statistics can help with many decisions:

  • Planning a vacation to Bali in December

  • Optimizing your fantasy football league chances

  • Budgeting your meal card in college

  • Deciding if the extra insurance on a blender is worth it

  • Deciding whether or not to have heart surgery

  • NGOs optimizing food aid to refugee camps

  • Policymakers deciding on student loan spending

  • Deciding how much to borrow for college

Limitations of Statistics

Thinking statistically means knowing the difference between what statistics can and cannot do.

  • Example: Statistics can tell you if your mom gives your brother more ice cream, but not whether she loves him more.

    • She might just be giving you extra sprinkles.

    • It’s important to note, that we may have a question that we want the answer to, but statistics can’t help us with it. Like trying to find a correlation between how much people pray and if that affects the weather for the year in that location. (Magic, obviously) But for questions like this, we could ask the question in different ways or break the question up in to smaller or more answerable questions in which statistics can help us answer those. For instance, instead of asking if prayer affects weather, we could explore how weather patterns change over time or examine the frequency of prayer in relation to significant weather events, allowing us to apply statistical methods effectively.

Structured Data

  • This technique is essential for understanding trends and making informed decisions based on statistical evidence.

  • Discovering the data types we work with is important because when we do it correctly, the data type can help determine the type of visual display, data analysis, or statistical model we use to answer questions and explore our data with.

  • Numerical: data expressed numerically

    • Continuous: data that can take on any value in an interval. (such as wind speed or time duration)

    • Discrete: Data that can take on only integer values, such as counts: such as the number of requests or occurrences in a specified period.

  • Categorical: data that can take on only a specific set of values representing a set of possible categories. (enums, enumerated, factors, nominals)

    • Can take only a fixed set of values,

      • Types of TV screens (plasma, LED, LCD)

      • Genders

      • Binary: A special case with just two categories of values: takes on only one of two values such as 0 or 1, yes or no.

      • Ordinal: Categorical data that has an explicit ordering

Unstructured Data

  • Images are a collection of pixels, whereas each pixel contains RGB color information.

  • Texts are sequences of words and nonword characters, often organized by sections, subsections, and more.

  • Clickstreams are sequences of actions by a user interacting with an app or a web page.

  • The majority of data available is unstructured. This is the real call-to-arms in terms of the future of data science, figuring out how to harness this massive amounts of data into actionable information.

Rectangular Data

  • Spreadsheet or database table

  • A general term for a two-dimensional matrix with rows indicating records (Cases) and columns indicating features (variables)

    • Outcomes: Often a yes / no outcome (the auction was competitive)

      • AKA: Dependent variables, responses, target, output

    • Records: AKA: Case, Example, Instance, Observation, Pattern, Sample

    • Features: AKA: attributes, inputs, predictors, variables

  • Data in relational data bases must be extracted and put into a single table for most data analysis and modeling tasks.

Category

Currency

SellerRating

Duration

EndDay

Competitive

Music/Movie/Game

US

3249

5

Mon

0

Music/Movie/Game

US

3249

5

Mon

0

Automotive

US

3115

7

Tues

1

Non-rectangular Data Structures

  • Time Series: records successive measurements of the same variables.

  • Spatial Data: represents information about objects in a space, capturing their shape, location, and connections.

  • Graph Data: consists of nodes (vertices) and edges (connections), ideal for representing relationships in networks.

  • Text Data: encompasses unstructured data such as documents, emails, and social media posts, where context and meaning can be extracted through natural language processing techniques.