Untitled Notes

Introduction

In this session, Anthony Clem introduces himself as a member of the CoSurvey development lab, associated with Noah's office. He has been involved with the crowdsourced bathymetry (CSB) working group for several years and focuses on designing tools for processing and visualizing crowdsourced bathymetry data. The topic of discussion revolves around the extraction of raw Comma-Separated Values (CSV) data from the Data Center for Deepwater Vessel Data (DCDV) into their processing data pipeline.

Overview of the Point Store vs. Submission Data Files

Anthony emphasizes the importance of understanding the difference between the point store and the actual submission data files. The point store is the source from which their processing pipeline extracts data. Users interested in data localization can utilize a web-based view and query tools to extract relevant point data for their specific area of interest. It is highlighted that it is feasible to retrieve a substantial amount of data—up to 1,500,000 points—based on geographical location.

Challenges with Large Data Sets

When it comes to processing extensive amounts of crowdsourced bathymetry data along the coastline of the United States, utilizing a simple web tool for automated processing is impractical. Instead of attempting to download an enormous dataset which may result in system failures, a more efficient strategy involving geographical segmentation through tessellation is employed.

Tessellation Scheme

The United States coastline is divided into smaller geospatial grids. The data are organized and created in geospatial data formats such as GeoPackage but can also be produced in shapefile format. This method prevents the system from getting overwhelmed when dealing with the large volume of data.

Considerations for Processing Data

There are various reasons for breaking the data spatially into different tiles for processing:

Tidal Correction: The data processing involves using discrete tidal zones, which necessitates querying specific tide station data tailored to the geographical area being processed.
System Capacity: Due to the substantial volume and memory requirements of handling numerous data points in system memory, geographic segmentation becomes necessary for practical data processing.

Automated Download Process

The strategy Anthony discusses includes the development of a programmatic approach that utilizes scripts to automate data downloading. In essence, the following steps are performed:

A computer script is created based on the tessellation scheme.
The script sequentially processes each tile of the dataset by making requests to the DCD database via the point store API.
Upon downloading the data from each tile, the process moves on to the next tile, generating individual CSV files correlating to the respective geographical divisions.

Naming Convention

Each tile is systematically named to facilitate the identification of its spatial location. This approach proves beneficial for troubleshooting and data management in the future.

Incremental Data Pulls

In future data retrieval operations, the same script can be adjusted to pull all data since the last download, leveraging the DCDV API’s capability to specify date filters in its JSON payload. As a result, users can avoid redownloading previously processed data, optimizing efficiency in data management.

CSV Scraper Tool

The script developed for these purposes is termed the "CSV scraper". It is a Python-based tool included in the free Hydro distribution. Although it is not equipped with a graphical user interface (GUI), users can operate it through an Integrated Development Environment (IDE) or command line for effective modification and execution. Anthony offers limited support for any inquiries regarding the tool’s usage.

Questions and Answers

A question arises regarding the spatial grids and whether their sizes align with the ENCs (Electronic Navigational Charts). Anthony clarifies that while he initially intended to integrate those grids with a specific tessellation scheme (such as S-102), he determined that the existing grid sizes were too large geographically; hence, he opted to design smaller ones.

Conclusion

The discussion wrapped up with gratitude towards participants, and the transition to the next presenter discussing use cases with Citco.