Data Quality and Standards Summary
Data Quality Parameters
Defining Data Quality
Quality assessment for data differs from manufactured goods due to data's intangible properties like 'completeness' and 'consistency'.
Data quality is crucial due to:
Increased data production by the private sector.
Increased use of GIS as a decision-support tool.
Increased reliance on secondary data sources.
Shift in responsibility: from producer-centered quality control to consumer-driven 'fitness-for-use'.
Producer role evolving towards data quality documentation or 'truth-in-labelling,' acknowledging error inevitability and misuse from incomplete knowledge of data limitations.
Data Quality Components
Geographical observations consist of spatial, temporal, and thematic components.
Space, time and theme influence various components, including accuracy, precision, consistency and completeness.
Accuracy
Accuracy is the inverse of error, defined as the discrepancy between the encoded and actual attribute value for a given entity within the entity-attribute-value model.
Spatial, temporal, and thematic errors are the discrepancies in encoded spatial, temporal, and thematic attribute values, respectively.
Accuracy measurements across space, time, and theme are interdependent.
The definition of error relies on the existence and observability of an objective, external reality, which often may not be possible in practice.
Accuracy assessment is performed against a database 'specification,' which defines the required level of abstraction and generalisation relative to real-world phenomena.
Ultimately accuracy depends on the intended form and content of the database.
Spatial Accuracy
Spatial accuracy, or positional accuracy, concerns the accuracy of the spatial component of a database.
For point entities, error is the Euclidean distance between encoded and specified locations.
Common measures include horizontal error (in x and y) and vertical error (in z).
Metrics used to summarize spatial error:
Mean error: tends to zero when bias is absent.
Root Mean Squared Error (RMSE): , commonly used for vertical accuracy in Digital Elevation Models (DEMs).
Horizontal error extends the classical error model, allowing statistical inference tests and confidence limits for point locations.
For lines, error is defined using variants of the epsilon band, representing a zone of uncertainty around an encoded line.
Early models depict a uniform 'sausage' shape.
Recent studies indicate non-uniform shapes and error distributions within the band.
Temporal Accuracy
Temporal accuracy refers to the agreement between encoded and 'actual' temporal coordinates and not to currentness.
Currentness measures how up to date a database is, which is application-specific.
Assessment of temporal accuracy relies on objective time measurement using a standard temporal coordinate system, which is not universally accepted.
Temporal information is often omitted in geospatial databases (except those for explicitly historical purposes).
Thematic Accuracy
Metrics of thematic accuracy vary with measurement scale.
For quantitative attributes, metrics are similar to those for spatial accuracy (e.g., RMSE).
For categorical data, the research has been driven by classification accuracy assessment in remote sensing.
Assessment involves comparing land cover classes assigned by a classification procedure to those observed on a reference source.
A 'classification error matrix' is used for accuracy assessment.
Metrics derived from the error matrix:
Proportion correctly classified.
Kappa.
User's and producer's accuracies.
The matrix provides information on misclassification frequencies, errors of omission, and errors of commission.
Precision or Resolution
Precision, or resolution, refers to the level of detail that can be discerned.
All data have limited resolution due to measurement system limitations and intentional generalisation in geospatial databases.
Generalisation involves:
Elimination and merging of entities.
Reduction in detail.
Smoothing.
Thinning.
Aggregation of classes.
Resolution affects the suitability of a database for specific applications and must match the required level of detail.
Accuracy and resolution are generally inversely related.
Spatial Resolution
Spatial resolution in remote sensing is defined by the ground dimensions of pixels in a digital image.
This determines the minimum size of discernible objects.
For vector data, the smallest discernible feature is based on rules for minimum mapping unit size, dependent on map scale.
Spatial resolution differs from the spatial sampling rate.
Resolution refers to the fineness of observable detail.
The sampling rate defines the ability to resolve patterns over space.
Temporal Resolution
Temporal resolution refers to the minimum discernible duration of an event and depends on both the recording interval and the event's rate of change.
Events shorter than the sampling interval are generally unresolvable (the 'synopticity' problem).
Interactions between spatial and thematic resolution must also be considered.
Temporal sampling rate is the frequency of repeat coverage, while resolution is the time collection interval for each measurement.
Thematic Resolution
In the thematic domain, resolution depends on the measurement scale.
For quantitative data, it is determined by the precision of the measurement device.
For categorical data, resolution is defined by the fineness of category definitions.
Consistency
Consistency refers to the absence of contradictions in a database.
For geospatial data, it specifies conformance with topological rules, such as:
Only one point at a given location.
Lines must intersect at nodes.
Polygons are bounded by lines.
Spatial inconsistencies can also be identified through redundancies in spatial attributes.
Little work has been done on consistency in the temporal domain, though a framework for temporal topology exists.
Thematic consistency requires a level of redundancy in thematic attributes.
The absence of inconsistencies does not guarantee accuracy.
Tests for thematic consistency are rare, despite the potential to exploit attribute redundancies.
Completeness
Completeness is the relationship between objects in the database and the 'abstract universe' of all such objects, influenced by selection criteria, definitions, and mapping rules.
Requires a precise description of the abstract universe.
Two types of completeness:
Data completeness: measurable error of omission between the database and the specification (application-independent).
Model completeness: agreement between the database specification and the abstract universe required for a specific application (application-dependent).
Additional distinctions:
Feature or entity completeness.
Attribute completeness: the degree to which all relevant attributes of a feature have been encoded.
Value completeness: the degree to which values are present for all attributes.
Completeness can be defined over space, time, or theme.
Completeness includes errors of omission and commission.
Data Quality Standards
Data quality standards are typically used for data transfer and metadata.
Data quality documentation is key to the effective use of geospatial data.
Examples of established standards include:
Spatial Data Transfer Standard (SDTS).
Content Standards for Digital Geospatial Metadata developed by the Federal Geographic Data Committee (FGDC).
National Transfer Format (NTF).
Digital Geographic Information Exchange Standard (DIGEST).
International Hydrographic Organisation (IHO) standard.
Draft standard of the Comité Européen de Normalisation (CEN).
Limitations of data quality standards:
Do not necessarily lend themselves to specific software implementations.
Treat data quality as static and do not dynamically account for changes in quality during data transformations.
Difficulty in ascertaining fitness-for-use due to the overabundance of information.
Difficulty in updating data quality documentation automatically in a GIS environment.
Data quality standards do not fully provide assurances demanded by agencies seeking to limit liability risks.
Quality assurance/quality control (QA/QC) programs are based on standard operating procedures that allow specific data quality objectives to be realised.
Metadata Systems
Metadata systems document data quality components.
Most GIS packages perform some metadata documentation, such as recording the number of rows and columns of cells in each layer for raster systems, or the spatial coordinate system for vector systems.
Few commercial GIS packages offer comprehensive data quality documentation.
Software packages exist for documenting layers with metadata, automatically updating the lineage of layers, and propagating data quality components.
Intelligent systems intercept GIS commands and dynamically build graphical representations of the data processing flow and derived layers.
Metadata analysis includes assessment of processing complexity, analysis of data source adequacy, propagation of error, and identifying strategies for enhancing derived data quality.
Cartographic Bias
Geospatial databases reflect biases similar to analogue cartographic data due to models embedded in GIS being digital translations of analogue models.
The vector data model differentiates geographical phenomena according to dimensionality (points, lines, and areas).
Cartographic fidelity implies that maps selectively represent the real world, portraying entities in a generalised way based on map scale and purpose.
Geospatial technology is evolving beyond the limitations of paper maps, offering new modes of representation.
The raster model allows for the usage of alternate models of geospatial data, such as the field-based model, probabilistic surfaces, and models based on fuzzy set theory.
Technology separates the storage and communication roles, allowing data to be collected in as raw a form as possible, with representations created to achieve specific communication objectives.
Data are often transformed and merged with other data in GIS, but there is no guarantee that the data are suitable for such applications.
'Use error' can occur.
Limited advancements have been made in assessing fitness-for-use and preventing use error.
GIS, Society, and Data Quality
Debates exist regarding whether geospatial databases are faithful images of reality or rhetorical devices conveying specific messages.
Critics suggest that GIS has led to a focus on producing truthful and objective representations of reality, neglecting the embedding of social and institutional values in databases.
Producers of geospatial data are generally aware of the limitations of these data.
Database content depends on the external environment and the values of the society and institution within which the database was constructed.
Values are embedded at the modelling and representation stages.
Unlike broad social values, values derived from institutional mandate can be documented and communicated through metadata.
The 'specification' concept recognises that each database has a particular set of objectives and that embedded in these objectives is the formal expression of the values associated with institutional factors.
Knowledgeable map users have always been aware of data limitations.