JB

Reading and Writing Data with Python using Pandas

Reading and Writing Data with Python

Working Directory and File Paths

  • Purpose: To locate data for loading into Python.
  • Checking Current Working Directory:
    • Use the os module.
    • import os
    • os.getcwd() or os.get_current_working_directory() displays the current file system location.
  • Loading Data from Different Directories:
    • If files are not in the current working directory, the correct directory must be specified.
    • Example using os.chdir() to change the directory:
      • os.chdir('kaggle') (moves one level up from kaggle/working to kaggle).
    • os.listdir() or os.listdir() lists files and folders in a directory.
  • Example Scenario:
    • Navigating Kaggle environment directories:
      • Moving from kaggle/working to kaggle.
      • Listing directories to find the input folder.
      • Exploring input/titanic to find train.csv, gender_submission.csv, and test.csv.

Reading Comma Separated Values (CSV) and Tab Separated Values (TSV) Files

  • CSV and TSV Files:
    • Common flat file data formats.
    • Readable and editable in spreadsheet programs like Excel.
    • CSV: Data values separated by commas.
    • TSV: Data values separated by tabs.
  • Reading CSV Files with pandas:
    • Use pandas.read_csv() function.
    • Example:
      • import pandas as pd
      • pd.read_csv('input/titanic/train.csv')
      • Loads the Titanic training dataset into a pandas DataFrame.
      • .head(6) displays the first six rows of the DataFrame.
  • Reading TSV Files with pandas:
    • Use pd.read_table() function.
    • Default delimiter is tab.
    • For other delimiters, use the sep argument.
    • Example:
      • pd.read_table('file.tsv')
      • pd.read_table('file.txt', sep='-') (if data is separated by hyphens).
    • Specifying tab character:
      • Use '\t' to represent a tab character in the sep argument.

Reading Data from Excel Files

  • Reading Excel Files with pandas:
    • Use pd.read_excel() function.
    • Requires specifying the file path and sheet name.
    • Example:
      • draft = pd.read_excel('input/draft2015/draft.xlsx', sheet_name='draft2015')
      • Loads data from the 'draft2015' sheet in the Excel file.
      • .head() displays the first few rows of the DataFrame.

Reading Data from the Web

  • General Approach:
    • Download data in CSV or Excel format when possible.
    • Use pd.read_csv() or pd.read_excel() to load the downloaded file.
  • Reading Data from Clipboard:
    • Use pd.read_clipboard() function.
    • Copy data from a website or other source and paste it into the clipboard.
    • Example:
      • df = pd.read_clipboard()
    • May require cleaning and reformatting in a spreadsheet program before reading.
  • Reading HTML Tables:
    • Use pd.read_html() function.
    • Requires the html5lib package.
  • Handling Nested Data Structures (HTML, JSON, XML):
    • Requires more complex parsing and specialized libraries.

Writing Data with pandas

  • Writing Data to CSV Files:
    • Use the .to_csv() method on a DataFrame.
    • Example:
      • draft.to_csv('draft_save.csv')
      • Saves the draft DataFrame to a CSV file named 'draft_save.csv' in the current working directory.
  • Verifying the Output:
    • Use os.listdir() to confirm the file has been created.
  • Application:
    • Saving prediction results in machine learning competitions.

Conclusion

  • Pandas for Data Handling:
    • Simplifies reading and writing common data formats.
  • Databases and Nested Data:
    • Require specialized libraries and more complex handling.
  • Python's Versatility:
    • Libraries available for almost any data format.
    • Research and documentation are key to finding the right tools.
  • Next Steps:
    • Learning basic Python programming constructs and custom functions.
    • Control flow (if/else logic, for/while loops) for custom code execution.