Pandas Basics: DataFrames, Indexing, and Data Wrangling

Pandas Basics: Working with Tabular Data

Pandas is the go-to Python package for tabular data work. It provides powerful data structures and operations for manipulating structured data, especially tabular data.
The central data structure is the DataFrame, a 2D labeled data structure with columns (fields) and rows (records/observations). An optional index identifies rows; by default it is a simple range like 0, 1, 2, … but you can provide meaningful indices (e.g., product IDs).
Terminology:
- Fields/columns: the named data columns in a dataset.
- Records/observations: the rows in a dataset.
- Index: a label for each row; ideally unique.
A DataFrame is built from many sources; common example shown uses a dictionary where keys become column names and values are lists of data, e.g.:
- keys: "product", "price", "quantity_sold", "state"
- values: lists of data for each column.
Import pattern:
- Basic import: import pandas as pd
- Alias shortens subsequent calls (e.g., pd instead of pandas).

Core concepts: DataFrame, Series, and index

A DataFrame is made up of Series; indexing or slicing a DataFrame by a single column yields a Series, not a DataFrame.
A Series is the 1D labeled array that underlies a DataFrame column.
DataFrames carry a shape: \text{shape} = (#\text{rows}, #\text{columns}). For example, a dataset with 244 rows and 7 columns has shape (244, 7) .
Typical initial properties to inspect:
- Columns: df.columns
- Data types: df.dtypes (or df[col].dtype for a single column)
- Non-null counts and memory: df.info()
- Quick numeric summaries: df.describe() (only numeric columns)
- Column-wise counts by category: df['state'].value_counts()
For a practical example, a CSV dataset read into pandas might contain 244 records and 7 columns (e.g., a tips dataset). The value counts help identify the distribution of categorical columns like state.

Quick start: creating and loading data

Create a simple DataFrame from a dictionary (columns become keys):
- Example: keys become column names, values are lists of data.
- Code style example (conceptual): pd.DataFrame(data) where data is a dict.
Reading data from sources:
- CSV: pd.read_csv('filename.csv')
- Excel: pd.read_excel('filename.xlsx')
- JSON: pd.read_json('filename.json')
- Parquet: a columnar store optimized for large datasets; useful for big data; supports partitioning.
In environments like Google Colab, you typically upload the file into the Colab workspace (e.g., via the Files pane) and reference the filename directly: pd.read_csv('tips.csv'). A sample dataset might be described as having, for instance, 244 records and 7 columns.

First look at the data

Print a sample of rows to understand structure without printing everything:
- df.head() prints the first 5 rows by default; df.head(n) prints the first n rows.
- For broader sampling, use df.sample(n=3) to get a random subset of rows.
Check the dataset shape to understand size: df.shape yields a tuple like (244, 7) .
Inspect columns and data types to anticipate analysis pitfalls (e.g., IDs stored as long integers or strings):
- df.columns to see column names
- df.dtypes to see each column's type, e.g., object, int64, float64
- If a date column is read as object, convert it later using pd.to_datetime.
Descriptive statistics and data quality:
- df.describe() gives quick statistics for numeric columns (count, mean, std, min, max, quartiles).
- df.info() provides non-null counts and dtypes for all columns.
- Example observation: a categorical column like state may show counts per category via df['state'].value_counts().
Practical note: for large datasets, printing the entire frame is impractical; rely on head(), sample(), and shape to understand content.

Accessing and selecting data

Access a single column (fields) like a dictionary: df['price'] yields a Series; you can also use df.loc or column attribute access (e.g., df.price) but the bracket form is safer.
Select multiple columns: df[['price', 'quantity_sold']] returns a new DataFrame with those columns.
Indexing vs slicing:
- df.loc is label-based (row labels and column names).
- df.iloc is position-based (row/column indices).
Examples:
- First row: df.iloc[0] or df.loc[df.index[0]]
- Specific cell: df.loc[0, 'price'] or df.iloc[0, 1] (depending on column order)
- Slice rows: df.iloc[0:3] returns a DataFrame of the first 3 rows
- Slice rows and specific columns: df.loc[0:2, ['price', 'state']] (note: inclusive vs exclusive for label-based slices depends on method; with iloc it's exclusive of the end index)
Boolean filtering (slicing with conditions):
- Create a boolean mask, e.g., mask = df['tip'] >= 5; then df[mask] yields rows where the condition is True.
- Combine multiple conditions: use parentheses and bitwise operators, e.g., (df['time'] == 'Lunch') & (df['day'] == 'Friday'); then df.loc[mask] filters accordingly.
- Example: filter for rows where time is lunch: df.loc[df['time'] == 'Lunch'].
Chaining multiple filters is common and preferred over explicit loops for performance.

Data types and casting

Pandas data types include: float, int, bool, object (strings), and date/time types (e.g., datetime64[ns]).
Reading data can infer types, but you may need to cast:
- Cast entire columns: df['col'] = df['col'].astype(int) or astype(float) or astype(str)
- For dates, convert with: pd.to_datetime(df['date_col'])
Why it matters: data type mis-matches (e.g., an identifier read as a numeric type) can lead to truncation or mis-sorting. Always sanity-check with df.dtypes.
Important distinction: a DataFrame is composed of Series; a single column access returns a Series, while the DataFrame remains a 2D structure.

Deriving new columns (vectorized operations)

A natural derived column example: compute revenue as price × quantity sold.
- Revenue formula: ext{revenue} = ext{price} imes ext{quantity ext_ sold}
- In pandas: df['revenue'] = df['price'] * df['quantity_sold']
Important performance note: avoid Python for-loops to operate over DataFrames; pandas vectorized operations are much faster and scalable.

Sorting data

Sorting by a column (e.g., price) with descending order:
- df = df.sort_values(by='price', ascending=False)
Default sort is ascending; to reverse, set ascending=False.

Grouping and aggregation (GroupBy)

GroupBy concept: split the DataFrame into groups by one or more keys, perform an aggregate, and then (optionally) combine results.
- Example intuition: group by a key and summarize data within each group.
- After grouping by a column (or multiple columns), you apply an aggregation function on a column, e.g., sum, mean, max, min.
Typical syntax pattern (step-by-step):
- Start with a DataFrame, anchor it, then call groupby with one or more key columns (string names or a list of names):
- Single key: grouped = df.groupby('key')
- Multiple keys: grouped = df.groupby(['key1', 'key2'])
- Then select the column(s) to aggregate and apply an aggregation, e.g.:
- summary = grouped['col_to_aggregate'].mean()
- For multiple aggregations: summary = grouped['col_to_aggregate'].agg(['mean', 'sum', 'max'])
Example using the tips dataset: group by meal time and compute the average total bill or the average tip per time slot.
If grouping by multiple columns, you can group by a list and then perform aggregations on one or more numeric columns: e.g.,
- grouped = df.groupby(['day', 'time'])[['total_bill', 'tip']].mean()
Notes:
- GroupBy often results in an index based on the group keys. You can re-sort with sort_values if you want descending order by the aggregated value.
- You can also compute multiple statistics at once by using agg with a list or a dict to map columns to aggregations.

Practical example: wide vs long data and reshaping

Wide vs long representations are two common layouts for the same data:
- Wide: many columns represent different variables (e.g., points, assists, rebounds for teams).
- Long: a compact single column for values and another for variable names (melting).
Melting (to long): pd.melt(DataFrame, id_vars=[...], var_name='variable', value_name='value')
- Result: a long format with two new columns: variable (original column name) and value (the corresponding data).
- Example on a tips dataset: melt to convert categorical indicators (e.g., dinner/lunch) into a long form with a single value column.
Pivoting (to wide): DataFrame.pivot(index='index_col', columns='variable', values='value')
Pivot table (with aggregation): DataFrame.pivot_table(index='index_col', columns='variable', values='value', aggfunc='mean')
Practical GDP example: a country-by-year dataset stored in a wide format (years across columns) can be melted to long, making it easier to analyze with grouping/aggregation.

Slicing, indexing, and filtering recap

Loc vs iloc:
- df.loc[ row_label, column_label] uses labels; can pass slices and lists of labels.
- df.iloc[ row_index, column_index] uses integer positions.
Basic examples:
- First row: df.loc[0] or df.iloc[0]
- Specific cells: df.loc[0, 'tip'] or df.iloc[0, 3]
- Slice rows and columns: df.loc[0:3, ['tip', 'total_bill']]
Boolean indexing (filters):
- Build a boolean mask, e.g., mask = df['tip'] >= 5; then df[mask].
- Combine tests with logical operators using parentheses: mask = (df['time'] == 'Lunch') & (df['day'] == 'Friday'); then df.loc[mask].
Practical note: boolean indexing is the common, efficient way to filter data; for more complex filtering, build masks progressively and combine.

Data exploration workflow and tips

When starting with a new dataset, perform a quick survey:
- Print the head to see column names and sample data: df.head().
- Check shape: (\text{rows}, \text{columns}) , e.g., (731, 16) in a DC bikes example.
- Check data types first to anticipate parsing issues (e.g., dates as objects).
- Use value_counts() on categorical columns to understand category distribution.
Typing awareness:
- In pandas, strings are typically stored as object dtype.
- You can cast columns to numeric types to enable better numeric analysis.
Performance note: prefer vectorized operations over Python loops for large datasets; loops tend to be slower and less scalable.
A few common pitfalls:
- Large numeric IDs may be truncated if read as integers; consider reading as strings or using pd.to_numeric with careful handling.
- Dates stored as objects can be converted to datetime types for easier extraction of year/month/day components.

Course logistics and assignments (summary)

Assignment 3 (R and Python): appears as a quiz-format activity with a final file upload requirement; due around two weeks from today.
There is also a semester-long project due in a few weeks; keep track of those deadlines.
Blackboard integration: ensure you can locate and submit assignments; some items may be temporarily hidden or moved.
Practical tip for notebooks: you can prepare both R and Python cells; it’s often easiest to implement logic in one language first, then translate the syntax to the other.

Real-world relevance and practice strategies

Sampling rather than printing entire datasets helps you understand data quality and distribution without overwhelming output.
EDA best practices include using groupby, melt/pivot transformations, and boolean filtering to explore patterns, relationships, and anomalies.
When you encounter new functions, Google is your friend for pandas equivalents and usage examples; common methods to recall include head, shape, info, describe, value_counts, groupby, agg, pivot, and melt.

Quick reference formulas and key notations

Revenue derivation (column-wise): ext{revenue} = ext{price} imes ext{quantity ext_ sold}
DataFrame shape example: \text{shape} = (244, 7)
Descriptive statistics for numeric columns include: \text{count}, \text{mean}, \text{std}, \text{min}, Q1, Q3, \text{max}
GroupBy aggregation sketch: \text{grouped} = df.groupby(\text{keys})[\text{col}].\text{agg}([\text{'mean'}, \text{'sum'}])
Melt (wide to long) concept: resulting columns include variable (original column names) and value (the entries); keep some identifiers in id_vars.
Pivot/pivot_table: reshape data with an index and columns built from other columns, optionally applying an aggregation function like mean.