Pandas DataFrames are two-dimensional data structures in the pandas package for Python.
They allow you to work with heterogeneous data types.
Built upon Pandas Series.
One-dimensional data objects in pandas.
Used to build DataFrames.
Import NumPy: import numpy as np
Import pandas: import pandas as pd
Similar to NumPy arrays but with the ability to define custom index labels.
Example:my_series = pd.Series(data=[1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
Dictionaries can be used to create series where keys become index values and values become data values.
my_dict = {'a': 2, 'b': 4, 'c': 6, 'd': 8}
my_series = pd.Series(my_dict)
Values can be accessed using associated labels or numerical indices.
my_series['a'] # Access by label
my_series[0] # Access by numerical index
Slicing returns both values and labels.
my_series['a':'c']
Operations are performed element-wise based on labels.
my_series + my_series # Adds elements with matching labels
Mismatched labels result in NaN
(Not a Number) values.
Many NumPy functions work on series.
np.mean(my_series)
Two-dimensional tables with labeled columns that can hold different types of data.
Python implementation of tables like those in Excel or SQL databases.
The standard data structure for tabular data in Python
Created using dictionaries, two-dimensional NumPy arrays, and series using pd.DataFrame()
.
When using the dictionaries, keys become column names, and values populate the columns.
data = {
'name': ['Joe', 'Bob', 'Franz'],
'age': np.array([20, 21, 19]),
'weight': (150, 160, 145),
'height': pd.Series([5.8, 5.9, 6.0], index=['Joe', 'Bob', 'Franz']),
'siblings': 1,
'gender': 'm'
}
df = pd.DataFrame(data)
Different sequence data structures (lists, NumPy arrays, series, tuples) can be used to populate columns.
Single values will fill the entire column.
If a pandas series with an index is used, that index will be used as the row index for the DataFrame.
Otherwise, numerical indices are used by default.
Custom row labels can be provided during DataFrame construction using the index parameter
df = pd.DataFrame(data, index=['Joe', 'Bob', 'Franz'])
DataFrames behave like dictionaries of Pandas Series objects.
Columns can be accessed using dictionary-like indexing or dot notation.
df['weight']
df.weight
Columns can be deleted using the del
keyword.
del df['weight']
New columns can be added like adding new objects to a dictionary.
df['IQ'] = [120, 130, 140]
If a series is inserted, it will be matched based on indices; unmatched indices will be NaN
.
If performing column additions, it is important to match the same length as other data objects in the DataFrame. If it doesn't have a matching length, the full column will be filled with the entry.
.loc
is used for label-based indexing.
df.loc['Joe'] # Get row with label 'Joe'
df.loc['Joe', 'IQ'] # Get value at row 'Joe', column 'IQ'
df.loc['Joe':'Bob', 'IQ':'college'] #Slicing columns and rows
.iloc
is used for integer-based indexing.
df.iloc[0] # Get row at index 0
Rows can be selected using a sequence of boolean values (logical index).
bool_index = [False, True, True]
df[bool_index]
Logical indexing is useful for subsetting data based on comparisons.
df[df['age'] > 12]
Useful when loading data from external sources.
Example using the Titanic dataset:
titanic_train = pd.read_csv('titanic_train.csv')
type(titanic_train) # pandas.DataFrame
Loading data will be covered in the next lesson.
.shape
attribute shows the dimensions of the DataFrame.
titanic_train.shape # (891 rows, 12 columns)
.head(n)
shows the first n
rows.
.tail(n)
shows the last n
rows.
Check Dataframe index: df.index
titanic_train.index = titanic_train['Name']
del titanic_train['Name']
.columns
attribute shows the column names.
titanic_train.columns
.describe()
function shows summary statistics for numeric columns.
titanic_train.describe()
NumPy functions can operate on DataFrame columns using axis=0
.
np.mean(titanic_train, axis=0)
.info()
function shows a summary of the DataFrame structure.
titanic_train.info()