JB

Pandas DataFrames and Series Explained

Pandas DataFrames

Introduction to Pandas DataFrames

  • Pandas DataFrames are two-dimensional data structures in the pandas package for Python.

  • They allow you to work with heterogeneous data types.

  • Built upon Pandas Series.

Pandas Series

  • One-dimensional data objects in pandas.

  • Used to build DataFrames.

Importing Libraries
  • Import NumPy: import numpy as np

  • Import pandas: import pandas as pd

Creating a Series
  • Similar to NumPy arrays but with the ability to define custom index labels.

  • Example:
    my_series = pd.Series(data=[1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

Creating Series using Dictionaries
  • Dictionaries can be used to create series where keys become index values and values become data values.

    my_dict = {'a': 2, 'b': 4, 'c': 6, 'd': 8}
    my_series = pd.Series(my_dict)
    
Accessing Values
  • Values can be accessed using associated labels or numerical indices.

    my_series['a']  # Access by label
    my_series[0]    # Access by numerical index
    
Slicing Series
  • Slicing returns both values and labels.

    my_series['a':'c']
    
Mathematical Operations
  • Operations are performed element-wise based on labels.

    my_series + my_series # Adds elements with matching labels
    
  • Mismatched labels result in NaN (Not a Number) values.

NumPy Functions
  • Many NumPy functions work on series.

    np.mean(my_series)
    

Pandas DataFrames

  • Two-dimensional tables with labeled columns that can hold different types of data.

  • Python implementation of tables like those in Excel or SQL databases.

  • The standard data structure for tabular data in Python

Creating DataFrames
  • Created using dictionaries, two-dimensional NumPy arrays, and series using pd.DataFrame().

    • When using the dictionaries, keys become column names, and values populate the columns.

    data = {
        'name': ['Joe', 'Bob', 'Franz'],
        'age': np.array([20, 21, 19]),
        'weight': (150, 160, 145),
        'height': pd.Series([5.8, 5.9, 6.0], index=['Joe', 'Bob', 'Franz']),
        'siblings': 1,
        'gender': 'm'
    }
    df = pd.DataFrame(data)
    
Column Creation
  • Different sequence data structures (lists, NumPy arrays, series, tuples) can be used to populate columns.

  • Single values will fill the entire column.

Row Index
  • If a pandas series with an index is used, that index will be used as the row index for the DataFrame.

  • Otherwise, numerical indices are used by default.

Custom Row Labels
  • Custom row labels can be provided during DataFrame construction using the index parameter

    df = pd.DataFrame(data, index=['Joe', 'Bob', 'Franz'])
    
Accessing and Modifying DataFrames
  • DataFrames behave like dictionaries of Pandas Series objects.

Accessing Columns
  • Columns can be accessed using dictionary-like indexing or dot notation.

    df['weight']
    df.weight
    
Deleting Columns
  • Columns can be deleted using the del keyword.

    del df['weight']
    
Adding Columns
  • New columns can be added like adding new objects to a dictionary.

    df['IQ'] = [120, 130, 140]
    
  • If a series is inserted, it will be matched based on indices; unmatched indices will be NaN.

  • If performing column additions, it is important to match the same length as other data objects in the DataFrame. If it doesn't have a matching length, the full column will be filled with the entry.

Indexing with .loc and .iloc
  • .loc is used for label-based indexing.

    df.loc['Joe']          # Get row with label 'Joe'
    df.loc['Joe', 'IQ']   # Get value at row 'Joe', column 'IQ'
    df.loc['Joe':'Bob', 'IQ':'college'] #Slicing columns and rows
    
  • .iloc is used for integer-based indexing.

    df.iloc[0]          # Get row at index 0
    
logical indexing
  • Rows can be selected using a sequence of boolean values (logical index).

    bool_index = [False, True, True]
    df[bool_index]
    
  • Logical indexing is useful for subsetting data based on comparisons.

    df[df['age'] > 12]
    

Exploring DataFrames

  • Useful when loading data from external sources.

Loading Data
  • Example using the Titanic dataset:

    titanic_train = pd.read_csv('titanic_train.csv')
    type(titanic_train)  # pandas.DataFrame
    
  • Loading data will be covered in the next lesson.

DataFrame Size
  • .shape attribute shows the dimensions of the DataFrame.

    titanic_train.shape  # (891 rows, 12 columns)
    
Viewing Rows
  • .head(n) shows the first n rows.

  • .tail(n) shows the last n rows.

Index Column
  • Check Dataframe index: df.index

Converting column to index and removing the column that was converted.
titanic_train.index = titanic_train['Name']
del titanic_train['Name']
Getting Column Names
  • .columns attribute shows the column names.

    titanic_train.columns
    
Summary Statistics
  • .describe() function shows summary statistics for numeric columns.

    titanic_train.describe()
    
NumPy Functions on DataFrames
  • NumPy functions can operate on DataFrame columns using axis=0.

    np.mean(titanic_train, axis=0)
    
DataFrame Information
  • .info() function shows a summary of the DataFrame structure.

    titanic_train.info()