Pandas Basics: DataFrames, Indexing, and Data Wrangling

Pandas Basics: Working with Tabular Data

  • Pandas is the go-to Python package for tabular data work. It provides powerful data structures and operations for manipulating structured data, especially tabular data.

  • The central data structure is the DataFrame, a 2D labeled data structure with columns (fields) and rows (records/observations). An optional index identifies rows; by default it is a simple range like 0, 1, 2, … but you can provide meaningful indices (e.g., product IDs).

  • Terminology:

    • Fields/columns: the named data columns in a dataset.

    • Records/observations: the rows in a dataset.

    • Index: a label for each row; ideally unique.

  • A DataFrame is built from many sources; common example shown uses a dictionary where keys become column names and values are lists of data, e.g.:

    • keys: "product", "price", "quantity_sold", "state"

    • values: lists of data for each column.

  • Import pattern:

    • Basic import: import pandas as pd

    • Alias shortens subsequent calls (e.g., pd instead of pandas).

Core concepts: DataFrame, Series, and index

  • A DataFrame is made up of Series; indexing or slicing a DataFrame by a single column yields a Series, not a DataFrame.

  • A Series is the 1D labeled array that underlies a DataFrame column.

  • DataFrames carry a shape: \text{shape} = (#\text{rows}, #\text{columns}). For example, a dataset with 244 rows and 7 columns has shape (244, 7) .

  • Typical initial properties to inspect:

    • Columns: df.columns

    • Data types: df.dtypes (or df[col].dtype for a single column)

    • Non-null counts and memory: df.info()

    • Quick numeric summaries: df.describe() (only numeric columns)

    • Column-wise counts by category: df['state'].value_counts()

  • For a practical example, a CSV dataset read into pandas might contain 244 records and 7 columns (e.g., a tips dataset). The value counts help identify the distribution of categorical columns like state.

Quick start: creating and loading data

  • Create a simple DataFrame from a dictionary (columns become keys):

    • Example: keys become column names, values are lists of data.

    • Code style example (conceptual): pd.DataFrame(data) where data is a dict.

  • Reading data from sources:

    • CSV: pd.read_csv('filename.csv')

    • Excel: pd.read_excel('filename.xlsx')

    • JSON: pd.read_json('filename.json')

    • Parquet: a columnar store optimized for large datasets; useful for big data; supports partitioning.

  • In environments like Google Colab, you typically upload the file into the Colab workspace (e.g., via the Files pane) and reference the filename directly: pd.read_csv('tips.csv'). A sample dataset might be described as having, for instance, 244 records and 7 columns.

First look at the data

  • Print a sample of rows to understand structure without printing everything:

    • df.head() prints the first 5 rows by default; df.head(n) prints the first n rows.

    • For broader sampling, use df.sample(n=3) to get a random subset of rows.

  • Check the dataset shape to understand size: df.shape yields a tuple like (244, 7) .

  • Inspect columns and data types to anticipate analysis pitfalls (e.g., IDs stored as long integers or strings):

    • df.columns to see column names

    • df.dtypes to see each column's type, e.g., object, int64, float64

    • If a date column is read as object, convert it later using pd.to_datetime.

  • Descriptive statistics and data quality:

    • df.describe() gives quick statistics for numeric columns (count, mean, std, min, max, quartiles).

    • df.info() provides non-null counts and dtypes for all columns.

    • Example observation: a categorical column like state may show counts per category via df['state'].value_counts().

  • Practical note: for large datasets, printing the entire frame is impractical; rely on head(), sample(), and shape to understand content.

Accessing and selecting data

  • Access a single column (fields) like a dictionary: df['price'] yields a Series; you can also use df.loc or column attribute access (e.g., df.price) but the bracket form is safer.

  • Select multiple columns: df[['price', 'quantity_sold']] returns a new DataFrame with those columns.

  • Indexing vs slicing:

    • df.loc is label-based (row labels and column names).

    • df.iloc is position-based (row/column indices).

  • Examples:

    • First row: df.iloc[0] or df.loc[df.index[0]]

    • Specific cell: df.loc[0, 'price'] or df.iloc[0, 1] (depending on column order)

    • Slice rows: df.iloc[0:3] returns a DataFrame of the first 3 rows

    • Slice rows and specific columns: df.loc[0:2, ['price', 'state']] (note: inclusive vs exclusive for label-based slices depends on method; with iloc it's exclusive of the end index)

  • Boolean filtering (slicing with conditions):

    • Create a boolean mask, e.g., mask = df['tip'] >= 5; then df[mask] yields rows where the condition is True.

    • Combine multiple conditions: use parentheses and bitwise operators, e.g., (df['time'] == 'Lunch') & (df['day'] == 'Friday'); then df.loc[mask] filters accordingly.

    • Example: filter for rows where time is lunch: df.loc[df['time'] == 'Lunch'].

  • Chaining multiple filters is common and preferred over explicit loops for performance.

Data types and casting

  • Pandas data types include: float, int, bool, object (strings), and date/time types (e.g., datetime64[ns]).

  • Reading data can infer types, but you may need to cast:

    • Cast entire columns: df['col'] = df['col'].astype(int) or astype(float) or astype(str)

    • For dates, convert with: pd.to_datetime(df['date_col'])

  • Why it matters: data type mis-matches (e.g., an identifier read as a numeric type) can lead to truncation or mis-sorting. Always sanity-check with df.dtypes.

  • Important distinction: a DataFrame is composed of Series; a single column access returns a Series, while the DataFrame remains a 2D structure.

Deriving new columns (vectorized operations)

  • A natural derived column example: compute revenue as price × quantity sold.

    • Revenue formula: ext{revenue} = ext{price} imes ext{quantity ext_ sold}

    • In pandas: df['revenue'] = df['price'] * df['quantity_sold']

  • Important performance note: avoid Python for-loops to operate over DataFrames; pandas vectorized operations are much faster and scalable.

Sorting data

  • Sorting by a column (e.g., price) with descending order:

    • df = df.sort_values(by='price', ascending=False)

  • Default sort is ascending; to reverse, set ascending=False.

Grouping and aggregation (GroupBy)

  • GroupBy concept: split the DataFrame into groups by one or more keys, perform an aggregate, and then (optionally) combine results.

    • Example intuition: group by a key and summarize data within each group.

    • After grouping by a column (or multiple columns), you apply an aggregation function on a column, e.g., sum, mean, max, min.

  • Typical syntax pattern (step-by-step):

    • Start with a DataFrame, anchor it, then call groupby with one or more key columns (string names or a list of names):

    • Single key: grouped = df.groupby('key')

    • Multiple keys: grouped = df.groupby(['key1', 'key2'])

    • Then select the column(s) to aggregate and apply an aggregation, e.g.:

    • summary = grouped['col_to_aggregate'].mean()

    • For multiple aggregations: summary = grouped['col_to_aggregate'].agg(['mean', 'sum', 'max'])

  • Example using the tips dataset: group by meal time and compute the average total bill or the average tip per time slot.

  • If grouping by multiple columns, you can group by a list and then perform aggregations on one or more numeric columns: e.g.,

    • grouped = df.groupby(['day', 'time'])[['total_bill', 'tip']].mean()

  • Notes:

    • GroupBy often results in an index based on the group keys. You can re-sort with sort_values if you want descending order by the aggregated value.

    • You can also compute multiple statistics at once by using agg with a list or a dict to map columns to aggregations.

Practical example: wide vs long data and reshaping

  • Wide vs long representations are two common layouts for the same data:

    • Wide: many columns represent different variables (e.g., points, assists, rebounds for teams).

    • Long: a compact single column for values and another for variable names (melting).

  • Melting (to long): pd.melt(DataFrame, id_vars=[...], var_name='variable', value_name='value')

    • Result: a long format with two new columns: variable (original column name) and value (the corresponding data).

    • Example on a tips dataset: melt to convert categorical indicators (e.g., dinner/lunch) into a long form with a single value column.

  • Pivoting (to wide): DataFrame.pivot(index='index_col', columns='variable', values='value')

  • Pivot table (with aggregation): DataFrame.pivot_table(index='index_col', columns='variable', values='value', aggfunc='mean')

  • Practical GDP example: a country-by-year dataset stored in a wide format (years across columns) can be melted to long, making it easier to analyze with grouping/aggregation.

Slicing, indexing, and filtering recap

  • Loc vs iloc:

    • df.loc[ row_label, column_label] uses labels; can pass slices and lists of labels.

    • df.iloc[ row_index, column_index] uses integer positions.

  • Basic examples:

    • First row: df.loc[0] or df.iloc[0]

    • Specific cells: df.loc[0, 'tip'] or df.iloc[0, 3]

    • Slice rows and columns: df.loc[0:3, ['tip', 'total_bill']]

  • Boolean indexing (filters):

    • Build a boolean mask, e.g., mask = df['tip'] >= 5; then df[mask].

    • Combine tests with logical operators using parentheses: mask = (df['time'] == 'Lunch') & (df['day'] == 'Friday'); then df.loc[mask].

  • Practical note: boolean indexing is the common, efficient way to filter data; for more complex filtering, build masks progressively and combine.

Data exploration workflow and tips

  • When starting with a new dataset, perform a quick survey:

    • Print the head to see column names and sample data: df.head().

    • Check shape: (\text{rows}, \text{columns}) , e.g., (731, 16) in a DC bikes example.

    • Check data types first to anticipate parsing issues (e.g., dates as objects).

    • Use value_counts() on categorical columns to understand category distribution.

  • Typing awareness:

    • In pandas, strings are typically stored as object dtype.

    • You can cast columns to numeric types to enable better numeric analysis.

  • Performance note: prefer vectorized operations over Python loops for large datasets; loops tend to be slower and less scalable.

  • A few common pitfalls:

    • Large numeric IDs may be truncated if read as integers; consider reading as strings or using pd.to_numeric with careful handling.

    • Dates stored as objects can be converted to datetime types for easier extraction of year/month/day components.

Course logistics and assignments (summary)

  • Assignment 3 (R and Python): appears as a quiz-format activity with a final file upload requirement; due around two weeks from today.

  • There is also a semester-long project due in a few weeks; keep track of those deadlines.

  • Blackboard integration: ensure you can locate and submit assignments; some items may be temporarily hidden or moved.

  • Practical tip for notebooks: you can prepare both R and Python cells; it’s often easiest to implement logic in one language first, then translate the syntax to the other.

Real-world relevance and practice strategies

  • Sampling rather than printing entire datasets helps you understand data quality and distribution without overwhelming output.

  • EDA best practices include using groupby, melt/pivot transformations, and boolean filtering to explore patterns, relationships, and anomalies.

  • When you encounter new functions, Google is your friend for pandas equivalents and usage examples; common methods to recall include head, shape, info, describe, value_counts, groupby, agg, pivot, and melt.

Quick reference formulas and key notations

  • Revenue derivation (column-wise): ext{revenue} = ext{price} imes ext{quantity ext_ sold}

  • DataFrame shape example: \text{shape} = (244, 7)

  • Descriptive statistics for numeric columns include: \text{count}, \text{mean}, \text{std}, \text{min}, Q1, Q3, \text{max}

  • GroupBy aggregation sketch: \text{grouped} = df.groupby(\text{keys})[\text{col}].\text{agg}([\text{'mean'}, \text{'sum'}])

  • Melt (wide to long) concept: resulting columns include variable (original column names) and value (the entries); keep some identifiers in id_vars.

  • Pivot/pivot_table: reshape data with an index and columns built from other columns, optionally applying an aggregation function like mean.