lec 10

Data Manipulation with Python and Pandas

Introduction to Data Manipulation

  • Course Title: OMIS 324: Business Analytics with Python
  • Instructor: Ying Wang, Ph.D. Associate Professor of MIS
  • Institution: Northern Illinois University, College of Business

Data Exploration

  • Focuses on using Python and the Pandas library for data manipulation and analysis.

Filtering Data

Basic Concepts

  • Python Keywords for Filtering:
    • & : Logical AND operator
    • | : Logical OR operator
    • ~ : Logical NOT operator

Examples of Filtering Data

  • Display employees whose Age is greater than 40:
  df[df['Age'] > 40]
  • Display female employees with an income greater than or equal to $100,000:
  df[(df['Gender'] == 'Female') & (df['Income'] >= 100000)]

Logical Operators

  • OR Operator (|):
    • Example of combining filters:
  df[(df['Age'] < 40) | (df['JobLevel'] == 'L3')]
  • AND Operator (&):
    • When combining filters, parentheses are necessary for correct evaluation:
  df[(df['Age'] >= 40) | (df['JobLevel'] == 'L3') & (df['Region'] != 'West')]

Sorting Data

Sorting Functions

  • Method: sort_values() for sorting data in either ascending or descending order.
  • **Example of Sorting **:
    • Sorted by Income:
      python df.sort_values('Income')
    • Sorted by Age in descending order:
      python df.sort_values('Age', ascending=False)

Example Data for Sorting

NameAgeGenderIncome
Evelyn Wright23Female52000
Jacob King46Male125000
Amelia Hall25Female55000
Ava Anderson45Female120000
Emily Johnson24Female58000
Charlotte Lewis42Female115000
Isabella Moore26Female60000
Daniel Brown41Male110000
Olivia Davis27Female65000
Ethan Allen39Male105000
Mia Harris28Female70000
William Thomas38Male102000
Michael Smith29Male72000
Alexander Clark36Male98000

Exercises: Filtering and Sorting Data

  • Exercise 1.2: Filtering and Sorting on StudentProfile DataFrame
    1. Select students with a GPA > 3.5.
    2. Select students receiving Financial Aid (Y).
    3. Select students with a GPA > 3.5 and Major = Computer Science.
    4. Select students with a GPA > 3.8 or receiving Financial Aid.
    5. Select students with a GPA > 3.2 and not in the Arts & Sciences college, then sort by GPA (descending).
    6. Select students majoring in Computer Science or Data Analytics, then sort by GPA (descending) and Credits Earned (descending).
  • Challenge: Among students receiving Financial Aid (Y), find the top 5 students by GPA.

Data Modification Techniques

Adding Columns

  • To add a new column to the end of a DataFrame:
  df['Bonus_Eligible'] = ['Y', 'N', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'N']
  • Display DataFrame using head():
    python df.head()

Adding Columns in the Middle of a DataFrame

  • Method to use: .insert(location, column name, column value)
    • Step 1: Define the value.
    • Step 2: Insert the value.

Adding Rows to DataFrame

  • Use pd.concat() to add rows:
    • Structure: Each column in a DataFrame must be a list even if there is only one row.
    • Example syntax:
      python pd.concat([existing DataFrame, new rows], ignore_index=True)
    • Use of ignore_index=True: Ensures the index remains continuous.
  • Example to create new rows:
  NewEmployee = pd.DataFrame({
    'ID': [21, 22],
    'Name': ['John Smith', 'Alice Smith'],
    'Age': [35, 33],
    'Income': [60000, 50000],
    'State': ['IL', 'IL'],
    'Region': ['Midwest', 'Midwest'],
    'YearsofExperience': [1, 2],
    'JobLevel': ['L1', 'L1'],
    'Bonus Eligible': ['N', 'N']
  })
  df = pd.concat([df, NewEmployee], ignore_index=True)
  • Display Last 3 Rows of the DataFrame:
  df.tail(3)

Updating Cell Values

  • Method: .at[row index, column name]
    • Example: Update the value for a specific cell:
      python df.at[row_index, 'column_name'] = new_value

Deleting Columns

  • Use .drop(column name, axis='columns') to delete:
    • Inplace Modification: Modify the existing DataFrame instead of creating a new one:
      python df.drop('column_name', axis=1, inplace=True)

Deleting Rows

  • Use .drop(row index, axis='rows') to delete:
    • Inplace Modification: Modify the existing DataFrame instead of creating a new one:
      python df.drop(row_index, axis=0, inplace=True)

Data Import and Correction Example

  • Importing a CSV file: Use pd.read_csv() method.
  • Correct entry for Joe Biden's Party: Assign value to the corresponding cell in DataFrame.
  • Adding Information to DataFrame: Create a new row for Presidents' profile.

Summary Functions and Typical Syntax

  • Methods to modify DataFrames described by Typical Syntax:
    • To add a new column to the end:
      python df['NewCol'] = values
    • To insert a specific position:
      python df.insert(loc, 'NewCol', value)
    • To create a new DataFrame:
      python new_row = pd.DataFrame([{'ID':[21],'Name':['John']}])
    • To append new rows:
      python df = pd.concat([df, new_rows], ignore_index=True)
    • To update a single cell value:
      python df.at[row, 'col'] = value
    • To delete column(s):
      python df.drop(labels, axis=1, inplace=True)
    • To delete row(s):
      python df.drop(row_index, axis=0, inplace=True)