lec 10

Data Manipulation with Python and Pandas

Introduction to Data Manipulation

Course Title: OMIS 324: Business Analytics with Python
Instructor: Ying Wang, Ph.D. Associate Professor of MIS
Institution: Northern Illinois University, College of Business

Data Exploration

Focuses on using Python and the Pandas library for data manipulation and analysis.

Filtering Data

Basic Concepts

Python Keywords for Filtering:
- & : Logical AND operator
- | : Logical OR operator
- ~ : Logical NOT operator

Examples of Filtering Data

Display employees whose Age is greater than 40:

  df[df['Age'] > 40]

Display female employees with an income greater than or equal to $100,000:

  df[(df['Gender'] == 'Female') & (df['Income'] >= 100000)]

Logical Operators

OR Operator (|):
- Example of combining filters:

  df[(df['Age'] < 40) | (df['JobLevel'] == 'L3')]

AND Operator (&):
- When combining filters, parentheses are necessary for correct evaluation:

  df[(df['Age'] >= 40) | (df['JobLevel'] == 'L3') & (df['Region'] != 'West')]

Sorting Data

Sorting Functions

Method: sort_values() for sorting data in either ascending or descending order.
**Example of Sorting **:
- Sorted by Income:
  python df.sort_values('Income')
- Sorted by Age in descending order:
  python df.sort_values('Age', ascending=False)

Example Data for Sorting

Name	Age	Gender	Income
Evelyn Wright	23	Female	52000
Jacob King	46	Male	125000
Amelia Hall	25	Female	55000
Ava Anderson	45	Female	120000
Emily Johnson	24	Female	58000
Charlotte Lewis	42	Female	115000
Isabella Moore	26	Female	60000
Daniel Brown	41	Male	110000
Olivia Davis	27	Female	65000
Ethan Allen	39	Male	105000
Mia Harris	28	Female	70000
William Thomas	38	Male	102000
Michael Smith	29	Male	72000
Alexander Clark	36	Male	98000

Exercises: Filtering and Sorting Data

Exercise 1.2: Filtering and Sorting on StudentProfile DataFrame
1. Select students with a GPA > 3.5.
2. Select students receiving Financial Aid (Y).
3. Select students with a GPA > 3.5 and Major = Computer Science.
4. Select students with a GPA > 3.8 or receiving Financial Aid.
5. Select students with a GPA > 3.2 and not in the Arts & Sciences college, then sort by GPA (descending).
6. Select students majoring in Computer Science or Data Analytics, then sort by GPA (descending) and Credits Earned (descending).
Challenge: Among students receiving Financial Aid (Y), find the top 5 students by GPA.

Data Modification Techniques

Adding Columns

To add a new column to the end of a DataFrame:

  df['Bonus_Eligible'] = ['Y', 'N', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N']

Display DataFrame using head():
python df.head()

Adding Columns in the Middle of a DataFrame

Method to use: .insert(location, column name, column value)
- Step 1: Define the value.
- Step 2: Insert the value.

Adding Rows to DataFrame

Use pd.concat() to add rows:
- Structure: Each column in a DataFrame must be a list even if there is only one row.
- Example syntax:
  python pd.concat([existing DataFrame, new rows], ignore_index=True)
- Use of ignore_index=True: Ensures the index remains continuous.
Example to create new rows:

  NewEmployee = pd.DataFrame({
    'ID': [21, 22],
    'Name': ['John Smith', 'Alice Smith'],
    'Age': [35, 33],
    'Income': [60000, 50000],
    'State': ['IL', 'IL'],
    'Region': ['Midwest', 'Midwest'],
    'YearsofExperience': [1, 2],
    'JobLevel': ['L1', 'L1'],
    'Bonus Eligible': ['N', 'N']
  })
  df = pd.concat([df, NewEmployee], ignore_index=True)

Display Last 3 Rows of the DataFrame:

  df.tail(3)

Updating Cell Values

Method: .at[row index, column name]
- Example: Update the value for a specific cell:
  python df.at[row_index, 'column_name'] = new_value

Deleting Columns

Use .drop(column name, axis='columns') to delete:
- Inplace Modification: Modify the existing DataFrame instead of creating a new one:
  python df.drop('column_name', axis=1, inplace=True)

Deleting Rows

Use .drop(row index, axis='rows') to delete:
- Inplace Modification: Modify the existing DataFrame instead of creating a new one:
  python df.drop(row_index, axis=0, inplace=True)

Data Import and Correction Example

Importing a CSV file: Use pd.read_csv() method.
Correct entry for Joe Biden's Party: Assign value to the corresponding cell in DataFrame.
Adding Information to DataFrame: Create a new row for Presidents' profile.

Summary Functions and Typical Syntax

Methods to modify DataFrames described by Typical Syntax:
- To add a new column to the end:
  python df['NewCol'] = values
- To insert a specific position:
  python df.insert(loc, 'NewCol', value)
- To create a new DataFrame:
  python new_row = pd.DataFrame([{'ID':[21],'Name':['John']}])
- To append new rows:
  python df = pd.concat([df, new_rows], ignore_index=True)
- To update a single cell value:
  python df.at[row, 'col'] = value
- To delete column(s):
  python df.drop(labels, axis=1, inplace=True)
- To delete row(s):
  python df.drop(row_index, axis=0, inplace=True)