1/43
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Machine Learning
The use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data.
Types of Machine Learning
Supervised learning
Unsupervised learning
Reinforcement learning
Semi-supervised learning
Supervised learning
Labeled data for training to predict.
Training a model using a labeled dataset to predict until the algorithm is able to accurately predict new, unseen data.
The algorithm is given a set of labeled data with correct answers until the algorithm is able to predict.
Imagine a teacher is supervising the learning process of a student
The labeled training dataset: The teacher
The machine: The student
Learning is done repeatedly/iteratively
Areas of specialization
Image and speech recognition
Fraud detection
Recommendation system
Medical diagnostics
Weather forecasting
Algorithms
K-nearest neighbors (KNN)
Logistic Regression
Decision TreesSupport Vector Machine (SVM)
Random Forest
Unsupervised learning
A type of machine learning where unlabeled data are handled.
Tries to learn the pattern and structure of data on its own.
Areas of specialization
Clustering similar data items together
Types of clustering
Partitioning
Hierarchical
Density-based methods
Finding meaningful groups with a given dataset
Large dataset, where it would be time-consuming and expensive if done manually
Identifying hidden relationships that may not be immediately obvious
Segment customers
Creation of new patterns
Algorithms
K-mean clustering
Finds the optimum number of clusters in the data set.
Involves assigning each data point its cluster based on its mean
Cluster based on the mean of its nearest neighbor
Repeat until there are no more clusters to create
Advantages
Computationally efficient
Reinforcement learning
The agent learns by interacting with environmental data.
The agent performs certain actions and then observes the rewards or consequences.
Learns from mistakes.
Trial and error
The difference here is there is no correct answer to mimic.
Areas of specialization
Simulations
Statistical analysis
Semi-supervised learning
Combines the benefits of supervised and unsupervised learning.
Used when obtaining a fully labeled dataset is time-consuming and/or expensive.
Machine learning using Python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load iris dataset
iris = load_iris()
# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split( iris['data'], iris['target'], random_state = 0)
# Initialize your classifier
knn = KNeighborsClassifier(n_neighbors = 1)
# Fit the model
knn.fit(X_train, y_train)
# Make a prediction
prediction = knn.predict([[5, 2.9, 1, 0.2]])
print("Prediction: ", prediction)
Perception
The ability to capture, process, and actively make sense of information that’s being received by our senses.
Cognitive process that makes you interpret your surroundings.
Data Visualization
Abstract- Describes information that is not physical
Graphical display of abstract information.
Making sense/sense-making = data analysis
Communication
Powerful tool to discover, analyze, understand, and present your stories.
The goal is to translate abstract information into visual representations that can be easily, efficiently, accurately, and meaningfully decoded.
Statistical information is abstract
Can display relationships between non quantitative values with nodes.
Cognition
In an era where technology is rapidly reshaping the way we interact with the world, understanding the intricacies of AI is not just a skill, but a necessity for designers.
Pictures for the eyes and mind
Data visualization is only successful to the degree that it encodes information in a manner that our eyes can discern and our brains can understand.
Consider a case when you need to help people understand the primary causes of death in America.
To achieve this goal, the display should achieve the following:
Clearly indicates how the values relate to one another, which in this case is a part-to-whole relationship - the number of deaths per cause, when summed, equals all deaths during the year.
Represents the quantities accurately.
Makes it easy to compare the quantities.
Makes it easy to see the ranked order of values, such as from the leading cause of death to the least.
Makes obvious how people should use the information - what they should use it to accomplish - and encourages them to do this.
The traditional way to display this information graphically involves a pie chart
Gestalt principles of perception
Proximity
Similarity
Enclosure
Closure
Continuity
Connection
Information Visualization
Information: Processed data
Visualization: Images or graphics to communicate information
AKA Graphics Visualization, is any technique for creating images, diagrams, or animations to communicate a message. Visualization through visual imagery has been an effective way to communicate both abstract and concrete ideas since the dawn of humanity.
Input: Data
Data: Unprocessed information
Output: Information
Information visualization
The practice of representing data in a meaningful, visual way that users can interpret and easily comprehend. This includes data visualizations and dashboards. Information visualization is an effective way to share insights in a digestible format for non-experts.
Advantages of Information/Data Visualization
Eyes are drawn to colors and patterns.
Our culture is visual, including everything from art and advertisements to TV and movies.
A chart allows us to quickly see trends/patterns and outliers.
If we can see something, we internalize it quickly.
Storytelling with a purpose.
Helps keep interest in the subject.
More effective than a simple spreadsheet.
The easy sharing of information.
Interactively explore opportunities.
Helps with decision-making. (Data-driven decisions)
Visualize patterns and relationships.
Disadvantages of Information/Data Visualization
Sometimes data can be misrepresented or misinterpreted when placed in the wrong style of data visualization.
When viewing a visualization with many different data points, it’s easy to make an inaccurate assumption.
Visualizations can be designed wrong, making them biased and confusing.
Biased or inaccurate information
Correlation doesn’t always mean causation.
Core messages can get lost in translation.
Elements of visualizations
Images
Spreadsheets
Animations
Videos
Maps
Geospatial
Proportional symbol maps
Choropleth, Isopleth, Area Maps
Heatmaps
Treemaps
Dashboards
Tables
Diagrams
Graphs
Bar Graphs
Bullet Graphs
Box-and-whisker Plot
Infographic
Charts
Pie Charts
Bar Charts
Gantt Charts
Histograms
Geospatial
A visualization that shows data in map form using different shapes and colors to show the relationship between pieces of data and specific locations.
Focus on the relationship between data and its physical location to create insight.
Geovisualization overlays variables on a map using latitude and longitude to foster insight.
Maps are the primary focus. They act as a container for extra data. This allows for the creation of context using shapes and color to change the visual focus. They identify problems, track change, understand trends, and perform forecasting related to specific places and times.
Heat maps
A type of geospatial visualization in map form that displays specific data values as different colors (this doesn’t need to be temperatures, but that is a common use).
Treemaps
A type of chart that shows different, related values in the form of rectangles nested together.
Bullet graphs
A bar marked against a background to show progress or performance against a goal, denoted by a line on the graph.
Box-and-whisker Plot
These show a selection of ranges (the box) across a set measure (the bar).
Gantt charts
Typically used in project management, Gantt charts are a bar chart depiction of timelines and tasks.
Ten important factors of information visualization
Information becomes easily shareable
Decision making
Identify trends and patterns
Optimize resources
Resource allocation
Easier for stakeholders to understand/internalize information
Customer satisfaction
Enhanced efficiency
Cost reduction
Innovation and competitiveness
Social implications
Role of Python in data analysis and information visualization
Built with a focus on business information analysis
User-friendly syntax
Ecosystem libraries
Community support
Scalability
Integrability and interpretability
Pandas library
Helps with data analysis and manipulation
Data can be transformed
Time Series Analysis
Definition: A specific way of analyzing a sequence of data points collected over an interval of time.
Analysts record data points at consistent intervals over a set period of time rather than just recording the data points intermittently or randomly.
Time is a crucial variable because it shows how the data adjusts over the course of the data points as well as the final results. It provides an additional source of information and a set order of dependencies between the data.
Typically requires a large amount of data.
Ensures that trends or patterns discovered are not outliers.
Why organizations use time series data analysis
Time series analysis helps organizations understand the underlying causes of trends or systemic patterns over time and predict future events.
Examples of use cases for time series analysis
Weather data
Rainfall measurements
Temperature readings
Heart rate monitoring (EKG)
Brain monitoring (EEG)
Quarterly sales
Stock prices
Automated stock trading
Industry forecasts
Interest rates
Time Series Analysis considerations
Variability
Rate of Change
Measured in percentage between the two points.
Covariance
Cycles
Linear fashion of viewing data within a period of time.
Exceptions
Time Series Analysis models
Classification
Curve fitting
Descriptive analysis
Explanative analysis
Exploratory analysis
Forecasting
Intervention analysis
Segmentation
Time Series model displays
Line graph (Works best for time series analysis)
Analyzing patterns and exceptions
Bar plots
Compares individual values
Dot plots / Box plots
Analyze distribution changes
Radar graph
Comparing cycles
Heatmap
Analyze high-volume cyclical patterns and exceptions
Uses color to encode quantitative values
Time Series techniques and best practices
Aggregations to various time intervals
Examples:
Quarterly
Monthly
Weekly
Daily
Viewing time periods in context
Grouping related time intervals
Using running averages to enhance the perception of high-level patterns
Omitting missing values from the display
Optimize a graph’s aspect ratio
Using the logarithmic scale to compare the rate of change
Overlapping time scale to compare cyclical patterns
Using cycle plots to examine trends and cycles together
Combining individual and cumulative values to compare actuals to targets
Stacking line graphs to compare multiple variables
Expressing time as 0 - 100% to compare a synchronous proceeding
Time Series Analysis Python example
import pandas as pd
import matplotlib.pyplot as plt
# Simple time-series plot
time_series_data = pd.DataFrame({
'Date': pd.date_range(start='1/1/2022', periods=10, freq='D'),
'Stock_Price': [1, 2, 3, 4, 3, 4, 5, 6, 7, 8]
})
time_series_data.plot(x='Date', y='Stock_Price', kind='line')
plt.title('Time-Series Data')
plt.show()
When designing interaction with any type of navigation menu, we have to consider the following six aspects:
Symbols
Target areas
Interaction event
Layout
Levels
Functional context
Symbols
Users often rely on small visual clues, such as icons and symbols, to guide them through a website’s interface. Creating a system of symbolic communication throughout the website that is unambiguous and consistent is important.
The first principle in designing a drop-down navigation menu is to make users aware that it exists in the first place.
Triangle symbol
Plus symbol
Three-line symbol
Consistent use of symbols
Target areas
A simple yet important rule is that links in a navigation menu should be easy to read, large and consistently located. The area in the interface that is assigned to and activates a link is typically referred to as the target area.
Legibility
Size
Consistency of location
Interaction event
Four most common events:
Hovering
Clicking
Scrolling
Typing
Levels
Designing a single-level navigation menu is hard enough as it is. Incorporating multiple levels complicates the matter, especially on small screens.
Removing navigation levels
Levels and mobiles
Levels and mega-menus
Dynamic filters
Breadcrumbs
Mega-sites
Arrays
Collection of data/values of the same data type.
Example:
Array of student scores
Int array_score[10] = {75, 88, 69, 90, 66, 81, 98, 77, 85, 70}
Matrices
A two-dimensional data structure where numbers are arranged into rows and columns.
Example:
1 2 3 4
5 6 7 8
9 10 11 12
NumPy (Python library)
Function used to process numbers
Complex analysis of arrays
Multi-dimensional arrays
Data manipulation
Capable of handling large datasets with ease.
Mathematical functions to operate on data structures
Basic statistical operations/functions
Mean
Standard deviation
Skewness
Kurtosis
Broadcasting
Automatically expands smaller arrays to match the shape of larger ones.
Pandas (Python library)
Data manipulation capability
Data format and data types
Cleaning datasets
Transforms data into visualizations
DataFrame
A highly versatile data structure that is essentially a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
df.head()
Returns first 5 rows of the DataFrame
df.tail()
Returns last 5 rows of the DataFrame
df.info()
Concise summary of the DataFrame
df.describe()
Statistical insight into the numerical columns of the DataFrame
Series
A type of data structure in the Pandas library. It is a one dimensional labeled array that contains data of any type. It can be thought of as a single column in a DataFrame.
This means that the Series can be used to store a single column of data, such as a list of numbers, names, or any other data type. In pandas, you can create a Series from a list, array, or dictionary.
s.size()
Returns the number of elements in the series
s.mean()
Returns the mean (average) value of the series
s.std
Returns the standard deviation of the series
s.unique
Returns an array of unique values in the series
Matplotlib (Python library)
One of the most widely used and versatile Python libraries available for creating static, interactive, and animated visualizations. With Matplotlib, users can easily create a wide range of visualizations, including line plots, scatter plots, bar plots, histograms, and more. Additionally, Matplotlib provides a high degree of customization, allowing users to tailor their visualizations to their precise needs.
Additional features
Subplots
Legends
Annotations
Error bars
Seaborn (Python library)
A powerful library built on top of Matplotlib that offers a high-level, user-friendly interface. It integrates closely with Pandas data structures and incorporates best practices for effective data visualization. With Seaborn, you'll have access to a wider range of color palettes, more visually appealing plots, and simpler syntax.
Create more complex visualizations
More customization options
Tweaking color schemes
Adjusting axis limits
Adding annotations
Offers a variety of statistical plots
Bar plots
Pair plots
Heat maps
Violin plots
Facet grids
Joint plots