Data Manipulation with Python and Pandas
Introduction to Data Manipulation
- Course Title: OMIS 324: Business Analytics with Python
- Instructor: Ying Wang, Ph.D. Associate Professor of MIS
- Institution: Northern Illinois University, College of Business
Data Exploration
- Focuses on using Python and the Pandas library for data manipulation and analysis.
Filtering Data
Basic Concepts
- Python Keywords for Filtering:
& : Logical AND operator| : Logical OR operator~ : Logical NOT operator
Examples of Filtering Data
- Display employees whose Age is greater than 40:
df[df['Age'] > 40]
- Display female employees with an income greater than or equal to $100,000:
df[(df['Gender'] == 'Female') & (df['Income'] >= 100000)]
Logical Operators
- OR Operator (
|):- Example of combining filters:
df[(df['Age'] < 40) | (df['JobLevel'] == 'L3')]
- AND Operator (
&):- When combining filters, parentheses are necessary for correct evaluation:
df[(df['Age'] >= 40) | (df['JobLevel'] == 'L3') & (df['Region'] != 'West')]
Sorting Data
Sorting Functions
- Method:
sort_values() for sorting data in either ascending or descending order. - **Example of Sorting **:
- Sorted by Income:
python
df.sort_values('Income')
- Sorted by Age in descending order:
python
df.sort_values('Age', ascending=False)
Example Data for Sorting
| Name | Age | Gender | Income |
|---|
| Evelyn Wright | 23 | Female | 52000 |
| Jacob King | 46 | Male | 125000 |
| Amelia Hall | 25 | Female | 55000 |
| Ava Anderson | 45 | Female | 120000 |
| Emily Johnson | 24 | Female | 58000 |
| Charlotte Lewis | 42 | Female | 115000 |
| Isabella Moore | 26 | Female | 60000 |
| Daniel Brown | 41 | Male | 110000 |
| Olivia Davis | 27 | Female | 65000 |
| Ethan Allen | 39 | Male | 105000 |
| Mia Harris | 28 | Female | 70000 |
| William Thomas | 38 | Male | 102000 |
| Michael Smith | 29 | Male | 72000 |
| Alexander Clark | 36 | Male | 98000 |
Exercises: Filtering and Sorting Data
- Exercise 1.2: Filtering and Sorting on
StudentProfile DataFrame- Select students with a GPA > 3.5.
- Select students receiving Financial Aid (Y).
- Select students with a GPA > 3.5 and Major = Computer Science.
- Select students with a GPA > 3.8 or receiving Financial Aid.
- Select students with a GPA > 3.2 and not in the Arts & Sciences college, then sort by GPA (descending).
- Select students majoring in Computer Science or Data Analytics, then sort by GPA (descending) and Credits Earned (descending).
- Challenge: Among students receiving Financial Aid (Y), find the top 5 students by GPA.
Data Modification Techniques
Adding Columns
- To add a new column to the end of a DataFrame:
df['Bonus_Eligible'] = ['Y', 'N', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N']
- Display DataFrame using
head():
python
df.head()
Adding Columns in the Middle of a DataFrame
- Method to use:
.insert(location, column name, column value)- Step 1: Define the value.
- Step 2: Insert the value.
Adding Rows to DataFrame
- Use
pd.concat() to add rows:- Structure: Each column in a DataFrame must be a list even if there is only one row.
- Example syntax:
python
pd.concat([existing DataFrame, new rows], ignore_index=True)
- Use of
ignore_index=True: Ensures the index remains continuous.
- Example to create new rows:
NewEmployee = pd.DataFrame({
'ID': [21, 22],
'Name': ['John Smith', 'Alice Smith'],
'Age': [35, 33],
'Income': [60000, 50000],
'State': ['IL', 'IL'],
'Region': ['Midwest', 'Midwest'],
'YearsofExperience': [1, 2],
'JobLevel': ['L1', 'L1'],
'Bonus Eligible': ['N', 'N']
})
df = pd.concat([df, NewEmployee], ignore_index=True)
- Display Last 3 Rows of the DataFrame:
df.tail(3)
Updating Cell Values
- Method:
.at[row index, column name]- Example: Update the value for a specific cell:
python
df.at[row_index, 'column_name'] = new_value
Deleting Columns
- Use
.drop(column name, axis='columns') to delete:- Inplace Modification: Modify the existing DataFrame instead of creating a new one:
python
df.drop('column_name', axis=1, inplace=True)
Deleting Rows
- Use
.drop(row index, axis='rows') to delete:- Inplace Modification: Modify the existing DataFrame instead of creating a new one:
python
df.drop(row_index, axis=0, inplace=True)
Data Import and Correction Example
- Importing a CSV file: Use
pd.read_csv() method. - Correct entry for Joe Biden's Party: Assign value to the corresponding cell in DataFrame.
- Adding Information to DataFrame: Create a new row for Presidents' profile.
Summary Functions and Typical Syntax
- Methods to modify DataFrames described by Typical Syntax:
- To add a new column to the end:
python
df['NewCol'] = values
- To insert a specific position:
python
df.insert(loc, 'NewCol', value)
- To create a new DataFrame:
python
new_row = pd.DataFrame([{'ID':[21],'Name':['John']}])
- To append new rows:
python
df = pd.concat([df, new_rows], ignore_index=True)
- To update a single cell value:
python
df.at[row, 'col'] = value
- To delete column(s):
python
df.drop(labels, axis=1, inplace=True)
- To delete row(s):
python
df.drop(row_index, axis=0, inplace=True)