Pandas Basics: DataFrames, Indexing, and Data Wrangling
Pandas Basics: Working with Tabular Data
Pandas is the go-to Python package for tabular data work. It provides powerful data structures and operations for manipulating structured data, especially tabular data.
The central data structure is the DataFrame, a 2D labeled data structure with columns (fields) and rows (records/observations). An optional index identifies rows; by default it is a simple range like 0, 1, 2, … but you can provide meaningful indices (e.g., product IDs).
Terminology:
Fields/columns: the named data columns in a dataset.
Records/observations: the rows in a dataset.
Index: a label for each row; ideally unique.
A DataFrame is built from many sources; common example shown uses a dictionary where keys become column names and values are lists of data, e.g.:
keys:
"product","price","quantity_sold","state"values: lists of data for each column.
Import pattern:
Basic import:
import pandas as pdAlias shortens subsequent calls (e.g.,
pdinstead ofpandas).
Core concepts: DataFrame, Series, and index
A DataFrame is made up of Series; indexing or slicing a DataFrame by a single column yields a Series, not a DataFrame.
A Series is the 1D labeled array that underlies a DataFrame column.
DataFrames carry a shape: \text{shape} = (#\text{rows}, #\text{columns}). For example, a dataset with 244 rows and 7 columns has shape (244, 7) .
Typical initial properties to inspect:
Columns:
df.columnsData types:
df.dtypes(ordf[col].dtypefor a single column)Non-null counts and memory:
df.info()Quick numeric summaries:
df.describe()(only numeric columns)Column-wise counts by category:
df['state'].value_counts()
For a practical example, a CSV dataset read into pandas might contain 244 records and 7 columns (e.g., a tips dataset). The value counts help identify the distribution of categorical columns like
state.
Quick start: creating and loading data
Create a simple DataFrame from a dictionary (columns become keys):
Example: keys become column names, values are lists of data.
Code style example (conceptual):
pd.DataFrame(data)wheredatais a dict.
Reading data from sources:
CSV:
pd.read_csv('filename.csv')Excel:
pd.read_excel('filename.xlsx')JSON:
pd.read_json('filename.json')Parquet: a columnar store optimized for large datasets; useful for big data; supports partitioning.
In environments like Google Colab, you typically upload the file into the Colab workspace (e.g., via the Files pane) and reference the filename directly:
pd.read_csv('tips.csv'). A sample dataset might be described as having, for instance, 244 records and 7 columns.
First look at the data
Print a sample of rows to understand structure without printing everything:
df.head()prints the first 5 rows by default;df.head(n)prints the firstnrows.For broader sampling, use
df.sample(n=3)to get a random subset of rows.
Check the dataset shape to understand size:
df.shapeyields a tuple like (244, 7) .Inspect columns and data types to anticipate analysis pitfalls (e.g., IDs stored as long integers or strings):
df.columnsto see column namesdf.dtypesto see each column's type, e.g.,object,int64,float64If a date column is read as
object, convert it later usingpd.to_datetime.
Descriptive statistics and data quality:
df.describe()gives quick statistics for numeric columns (count, mean, std, min, max, quartiles).df.info()provides non-null counts and dtypes for all columns.Example observation: a categorical column like
statemay show counts per category viadf['state'].value_counts().
Practical note: for large datasets, printing the entire frame is impractical; rely on
head(),sample(), andshapeto understand content.
Accessing and selecting data
Access a single column (fields) like a dictionary:
df['price']yields a Series; you can also usedf.locor column attribute access (e.g.,df.price) but the bracket form is safer.Select multiple columns:
df[['price', 'quantity_sold']]returns a new DataFrame with those columns.Indexing vs slicing:
df.locis label-based (row labels and column names).df.ilocis position-based (row/column indices).
Examples:
First row:
df.iloc[0]ordf.loc[df.index[0]]Specific cell:
df.loc[0, 'price']ordf.iloc[0, 1](depending on column order)Slice rows:
df.iloc[0:3]returns a DataFrame of the first 3 rowsSlice rows and specific columns:
df.loc[0:2, ['price', 'state']](note: inclusive vs exclusive for label-based slices depends on method; withilocit's exclusive of the end index)
Boolean filtering (slicing with conditions):
Create a boolean mask, e.g.,
mask = df['tip'] >= 5; thendf[mask]yields rows where the condition is True.Combine multiple conditions: use parentheses and bitwise operators, e.g.,
(df['time'] == 'Lunch') & (df['day'] == 'Friday'); thendf.loc[mask]filters accordingly.Example: filter for rows where
timeis lunch:df.loc[df['time'] == 'Lunch'].
Chaining multiple filters is common and preferred over explicit loops for performance.
Data types and casting
Pandas data types include:
float,int,bool,object(strings), and date/time types (e.g.,datetime64[ns]).Reading data can infer types, but you may need to cast:
Cast entire columns:
df['col'] = df['col'].astype(int)orastype(float)orastype(str)For dates, convert with:
pd.to_datetime(df['date_col'])
Why it matters: data type mis-matches (e.g., an identifier read as a numeric type) can lead to truncation or mis-sorting. Always sanity-check with
df.dtypes.Important distinction: a DataFrame is composed of Series; a single column access returns a Series, while the DataFrame remains a 2D structure.
Deriving new columns (vectorized operations)
A natural derived column example: compute revenue as price × quantity sold.
Revenue formula: ext{revenue} = ext{price} imes ext{quantity ext_ sold}
In pandas:
df['revenue'] = df['price'] * df['quantity_sold']
Important performance note: avoid Python for-loops to operate over DataFrames; pandas vectorized operations are much faster and scalable.
Sorting data
Sorting by a column (e.g., price) with descending order:
df = df.sort_values(by='price', ascending=False)
Default sort is ascending; to reverse, set
ascending=False.
Grouping and aggregation (GroupBy)
GroupBy concept: split the DataFrame into groups by one or more keys, perform an aggregate, and then (optionally) combine results.
Example intuition: group by a key and summarize data within each group.
After grouping by a column (or multiple columns), you apply an aggregation function on a column, e.g.,
sum,mean,max,min.
Typical syntax pattern (step-by-step):
Start with a DataFrame, anchor it, then call
groupbywith one or more key columns (string names or a list of names):Single key:
grouped = df.groupby('key')Multiple keys:
grouped = df.groupby(['key1', 'key2'])Then select the column(s) to aggregate and apply an aggregation, e.g.:
summary = grouped['col_to_aggregate'].mean()For multiple aggregations:
summary = grouped['col_to_aggregate'].agg(['mean', 'sum', 'max'])
Example using the tips dataset: group by meal time and compute the average total bill or the average tip per time slot.
If grouping by multiple columns, you can group by a list and then perform aggregations on one or more numeric columns: e.g.,
grouped = df.groupby(['day', 'time'])[['total_bill', 'tip']].mean()
Notes:
GroupBy often results in an index based on the group keys. You can re-sort with
sort_valuesif you want descending order by the aggregated value.You can also compute multiple statistics at once by using
aggwith a list or a dict to map columns to aggregations.
Practical example: wide vs long data and reshaping
Wide vs long representations are two common layouts for the same data:
Wide: many columns represent different variables (e.g., points, assists, rebounds for teams).
Long: a compact single column for values and another for variable names (melting).
Melting (to long):
pd.melt(DataFrame, id_vars=[...], var_name='variable', value_name='value')Result: a long format with two new columns:
variable(original column name) andvalue(the corresponding data).Example on a tips dataset: melt to convert categorical indicators (e.g., dinner/lunch) into a long form with a single value column.
Pivoting (to wide):
DataFrame.pivot(index='index_col', columns='variable', values='value')Pivot table (with aggregation):
DataFrame.pivot_table(index='index_col', columns='variable', values='value', aggfunc='mean')Practical GDP example: a country-by-year dataset stored in a wide format (years across columns) can be melted to long, making it easier to analyze with grouping/aggregation.
Slicing, indexing, and filtering recap
Loc vs iloc:
df.loc[ row_label, column_label]uses labels; can pass slices and lists of labels.df.iloc[ row_index, column_index]uses integer positions.
Basic examples:
First row:
df.loc[0]ordf.iloc[0]Specific cells:
df.loc[0, 'tip']ordf.iloc[0, 3]Slice rows and columns:
df.loc[0:3, ['tip', 'total_bill']]
Boolean indexing (filters):
Build a boolean mask, e.g.,
mask = df['tip'] >= 5; thendf[mask].Combine tests with logical operators using parentheses:
mask = (df['time'] == 'Lunch') & (df['day'] == 'Friday'); thendf.loc[mask].
Practical note: boolean indexing is the common, efficient way to filter data; for more complex filtering, build masks progressively and combine.
Data exploration workflow and tips
When starting with a new dataset, perform a quick survey:
Print the head to see column names and sample data:
df.head().Check shape: (\text{rows}, \text{columns}) , e.g., (731, 16) in a DC bikes example.
Check data types first to anticipate parsing issues (e.g., dates as objects).
Use
value_counts()on categorical columns to understand category distribution.
Typing awareness:
In pandas, strings are typically stored as
objectdtype.You can cast columns to numeric types to enable better numeric analysis.
Performance note: prefer vectorized operations over Python loops for large datasets; loops tend to be slower and less scalable.
A few common pitfalls:
Large numeric IDs may be truncated if read as integers; consider reading as strings or using
pd.to_numericwith careful handling.Dates stored as objects can be converted to datetime types for easier extraction of year/month/day components.
Course logistics and assignments (summary)
Assignment 3 (R and Python): appears as a quiz-format activity with a final file upload requirement; due around two weeks from today.
There is also a semester-long project due in a few weeks; keep track of those deadlines.
Blackboard integration: ensure you can locate and submit assignments; some items may be temporarily hidden or moved.
Practical tip for notebooks: you can prepare both R and Python cells; it’s often easiest to implement logic in one language first, then translate the syntax to the other.
Real-world relevance and practice strategies
Sampling rather than printing entire datasets helps you understand data quality and distribution without overwhelming output.
EDA best practices include using groupby, melt/pivot transformations, and boolean filtering to explore patterns, relationships, and anomalies.
When you encounter new functions, Google is your friend for pandas equivalents and usage examples; common methods to recall include
head,shape,info,describe,value_counts,groupby,agg,pivot, andmelt.
Quick reference formulas and key notations
Revenue derivation (column-wise): ext{revenue} = ext{price} imes ext{quantity ext_ sold}
DataFrame shape example: \text{shape} = (244, 7)
Descriptive statistics for numeric columns include: \text{count}, \text{mean}, \text{std}, \text{min}, Q1, Q3, \text{max}
GroupBy aggregation sketch: \text{grouped} = df.groupby(\text{keys})[\text{col}].\text{agg}([\text{'mean'}, \text{'sum'}])
Melt (wide to long) concept: resulting columns include
variable(original column names) andvalue(the entries); keep some identifiers inid_vars.Pivot/pivot_table: reshape data with an index and columns built from other columns, optionally applying an aggregation function like
mean.