An intro to GIS: C4 Database Management

Learning Outcomes

Explain the difference between databases and database management systems (DBMS).
Outline why databases are important in GIS.
Give examples of the types of databases used with GIS.
Explain how the relational database model works.
Describe how to set up a relational database.
Explain how databases are linked with GIS.
List important considerations for GIS databases.

Introduction

Data about our world are produced continuously from various sources like:
- Remote sensing satellites.
- Automatic environmental monitoring equipment.
- Automated business transactions.
- Individuals engaged in research and survey work.
A large proportion of these data are managed in databases in text, numeric, and multimedia formats.
Examples of databases in various disciplines:
- Archaeology: British and Irish Archaeological Bibliography.
- Architecture: American database containing images of buildings and cities.
- Business Management: OECD statistical database for economic outlook.
- Soil Science: Soil Geographical Database of Europe.
- Population Research: Web database of research projects.
Everyday activities produce data that automatically enter databases, for example:
- Purchasing ski equipment updates bank, store, supplier, and marketing databases.
- Postal codes can serve as spatial references to link data in GIS.
The amount of data generated daily is immense and can be transformed into valuable information.
Market research companies, environmental agencies, and mapping agencies sell information.
Managing and sharing data efficiently is crucial to realize its value, and databases are a method for doing so.
Databases allow listing data in an appropriate order and undertaking searches for specific information.
Linking databases to GIS enhances the value of data and provides useful information.
Large organizations have adopted GIS for applications like automated mapping and facilities management (AM/FM) to decision support applications.
Integrating GIS with existing internal databases is important, many of which are relational.
Business databases can be data sources for GIS or be required to integrate with GIS for decision support.
This chapter addresses database management for large-scale, corporate applications of GIS.
The chapter considers the conceptual and theoretical ideas behind databases and introduces the main types of databases, explaining their characteristics, advantages, and disadvantages.
The steps involved in establishing a relational database are discussed in detail, as the relational database is the most commonly used at present.
In GIS, there are two types of data to be managed: spatial data and attribute data.
An entity (point, line, or area) has both:
- Spatial data: ‘where things are’.
- Attribute data: ‘what things are’.
Example: A latitude and longitude reference gives the location of a city, accompanied by attribute data like city name, country, and population figures.
Conventional databases stored only non-spatial entities but all entities in GIS are spatial, and offer more than just a method of handling the attributes of spatial entities.
Databases help convert data into information with value, where:
- Data = raw facts without structure and context.
- Information = results from the analysis or organization of data.
A database can sort, re-order, summarize, and combine data to provide information, and can be used to sort temperature values, calculate maximum, minimum, and average values, and convert units such as Celsius to Fahrenheit.
GIS can map the locations of temperature monitoring points, adding further value to the original data.
Decision-makers using GIS need information, not data, so a database offers one method of providing that information by organizing data into a more meaningful and usable form.
A database is a set of structured data which can be computer-based or manual (organized filing cabinet, a dictionary, telephone directory or address book).

Why Choose A Database Approach?

Manual handling of spatial data presents difficulties.
Organizations and companies may produce and use similar data for different purposes, storing that data in different formats and separately.
Consider Happy Valley ski resort before computerized databases existed; ski schools, hotels, and travel companies all handled client data separately.
Data was stored in different formats--box files, index cards, or in an employee's head.
Each organization uses the data for different purposes: ski school for booking lessons, hotels for booking rooms, travel companies for arranging flights.
Data duplication occurred using traditional data management approaches, such as a visitor’s address being held three times, with errors during transcription, and different versions existing across the companies.
Difficulties would be encountered if the organizations attempted to share their data and may be incompatibilities in data formats.
Date (1986) summarizes the problems with the traditional approach:
- Redundancy (unnecessary repetition).
- High maintenance costs.
- Long learning times and difficulties moving between systems.
- Ad hoc enhancements and improvements.
- Data-sharing difficulties.
- Lack of security and standards.
- Lack of coherent corporate views of data management.
Traditional methods of storing data cannot represent relationships, so a database approach is necessary.

The Database Approach

A database is a collection of related data that can be shared by different users.
Data are organized to promote ease of access and prevent unnecessary duplication.
Paper-based databases have limitations, such as:
- Inability to search by first name or part of an address.
- Inability to extract lists based on specific criteria.
- Lack of security.
- Inability to adapt the database for other uses.
- Expensive and time-consuming updating.
A computer approach can overcome these problems.
Some of the most often cited advantages of computer-based databases are summarized in Box 4.1.
The electronic telephone directory enquiry service available in most countries provides an example of a computerized database.

Database Management Systems

The data in a computer database are managed and accessed through a database management system (DBMS).
Individual application programs will access the data in the database through the DBMS.
For example, to book a new client’s ski lessons, the booking clerk will use an application produced using capabilities offered by the DBMS.
The clerk will not need to interact with the database directly or understand how data are structured within the database.
A DBMS is a computer program to control the storage, retrieval, and modification of data (Dale and McLaughlin, 1988).
A DBMS will allow users to join, manipulate, or otherwise access the data in any number of database files (Stern and Stern, 1993).
A DBMS must allow the definition of data and their attributes and relationships, as well as providing security and an interface between the end users and their applications and the data itself.
The functions of a DBMS can be summarized as:
- File handling and file management (for creating, modifying or deleting the database structure).
- Adding, updating, and deleting records.
- The extraction of information from data (sorting, summarizing and querying data).
- Maintenance of data security and integrity (housekeeping, logs, backup).
- Application building.
The overall objective of a DBMS is to allow users to deal with data without needing to know how the data are physically stored and structured in the computer.
To achieve this, DBMS usually comprise software tools for structuring, relating, and querying data; tools for the design of data entry and report forms; and application generators for the creation of customized applications.
A DBMS manages data that are organized using a database data model, analogous to the way in which spatial data are organized in a GIS according to a spatial data model (for example, raster or vector).

Database Data Models

There are a number of different database data models, and amongst those that have been used for attribute data in GIS:
- Hierarchical
- Network
- Relational
- Object-relational
- Object-oriented data models
Of these the relational data model has become the most widely used.
The Relational Database Model
At present the relational database model dominates GIS.
Many GIS software packages link directly to commercial relational database packages, and others include their own custom-designed relational database software.
Some GIS use a relational database to handle spatial as well as attribute data.
The relational data model is based on concepts proposed by Codd (1970).
Data are organized in a series of two-dimensional tables, each of which contains records for one entity and the tables are linked by common data, or keys.
Queries are possible on individual tables or on groups of tables.
Each table in a relational database contains data for one entity, and the data are organized into rows and columns, with the columns containing the attributes of the entity.
Each of the columns has a distinctive name, and each of the entries in a single column must be drawn from the same domain (where a domain may be all integer values, or dates or text).
Other characteristics are listed by Reeve (1996):
- There can be only one entry per cell.
- Each row must be distinctive (so that keys that use unique row entries are possible – in GIS location is often the key).
- Null values are possible where data values are not known.
The terminology of relational databases can be confusing, since different software vendors have adopted different terms for the same thing.
Table 4.1 illustrates the relationship between relational database terminology and the traditional table, or simple computer file.
A useful shorthand way of describing a table is using its ‘intension’. For the table in Figure 4.5 this would be:
- HOTEL (Hotel ID, Name, Address, No. rooms,Standard).
- The data in a relational database are stored as a set of base tables with the characteristics described above.
Other tables are created as the database is queried and these represent virtual views.
The table structure is extremely flexible and allows a wide variety of queries on the data.
Queries are possible on one table at a time, or on more than one table by linking through key fields.
Queries generate further tables, but these new tables are not usually stored.
There are few restrictions on the types of query possible.
With many relational databases querying is facilitated by menu systems and icons, or ‘query by example’ systems.
Queries are built up of expressions based on relational algebra, using commands such as SELECT (to select a subset of rows), PROJECT (to select a subset of columns) or JOIN (to join tables based on key fields).
SQL (standard query language) has been developed to facilitate the querying of relational databases.
SQL advantages:
- Completeness
- Simplicity
- Pseudo English-language style
- Wide application
SQL has not really developed to handle geographical concepts such as ‘near to’, ‘far from’ or ‘connected to’.
The availability of SQL is one of the advantages of the relational database model.
Additionally, the model has a sound theoretical base in mathematics, and a simple logical data model that is easy to understand.
The relational database model is more flexible than either of the previously used hierarchical or network models, but the model will always produce some data redundancy and can be slow and difficult to implement.
There are also problems with the handling of complex objects such as those found in GIS (and CAD and knowledge-based applications) as there is a limited range of data types, and difficulties with the handling of time.
Seaborn (1995) considers that many of the limitations of relational databases in GIS stem from the fact that they were developed to handle simple business data, and not complex multi-dimensional spatial data.
Relational databases are predominantly used for the handling of attribute data in GIS. For example, Esri’s ArcGIS maintains an attribute table in relational database software, using a unique ID to link this to spatial data.

Creating A Database

Database design and implementation are guided by the relationships between the data to be stored in the database.
The database design process is concerned with expressing these relationships, then implementation with setting up a new structure for these relationships within the chosen database software.
The stages of database creation are summarized in Box 4.2.
The first stage in designing a database, data investigation, can result in a mass of unstructured information on information flows, relationships and possible entities.
A key part of database development is data modeling.
There is a wide range of techniques available to assist this data modeling, including entity relationship modeling, (Chen, 1976) and normalization.

Steps in Database Creation

Data investigation is the fact finding stage, considering the type, quantity and qualities of data to be included in the database and deciding the nature of entities and attributes.
Data modeling is the process of forming a conceptual model of data by examining the relationships between entities and the characteristics of entities and attributes. This stage, like the data investigation stage, can be carried out independently of the soft- ware to be used.
Database design is the creation of a practical design for the database, dependent on the database software being used, and its data model.This is the process of translating the logical design for the database (produced during the data model- ling stage) into a design for the chosen DBMS. Field names, types and structure are decided. In practice, the design will be a compromise to fit the database design model with the chosen DBMS.
Database implementation is the procedure of populating the database with attribute data, and this is always followed by monitoring and upkeep, includ- ing fine tuning, modification and updating.

Entity Relationship Modeling (EAM)

There are four stages to entity relationship modeling (or entity attribute modeling, EAM):
- The identification of entities.
- The identification of relationships between entities.
- The identification of attributes of entities.
- The derivation of tables from this.
In a database for Happy Valley we may wish ‘hotels’, ‘tour companies’, ‘ski schools’ and ‘visitors’ to be regarded as entities.
Each entity has distinctive characteristics and can usually be described by a noun. Its characteristics are the attributes (for example, a hotel will have a name, address, number of rooms and standard), and its domain is the set of possible values (for example, the standard may be budget, standard, business or luxury).
The relationships between the entities can be described using verbs. Thus, a hotel is located in a resort; a visitor stays at a hotel, and a ski school teaches visitors.
Three types of relationship are possible:
- One to one (one visitor stays at one hotel).
- One to many (one ski school teaches many visitors).
- Many to many (many tour companies use many hotels).
These relationship can be expressed by the symbols: ‘1:1’, ‘1:M’ and ‘M:N’, and by using simple diagrams.

Table Development and Attributes

The entity relationship model diagram is helpful in deciding what will be appropriate tables for a rela- tional database.
Where the relationship is 1:1, tables for each entity can be joined together or kept sepa- rate.
Where the relationship is 1:M, two tables are needed with a key field to allow a relational join.
Where the relationship is M:N, the tables should be separated.
Where repeating fields occur, tables need to be broken down further to avoid redundancy.
Once the tables have been decided upon, the attributes needed should be identified.
This is the process of developing intensions that was outlined earlier.

Happy Valley Example Intensions:

HOTEL (Hotel ID, Name, other attributes …)
TRAVELCO (TravelCo ID, TravelCo Name, other attributes …)
SKISCHOOL (SkiSchool ID, SkiSchool Name, other attributes …)
VISITOR (Visitor ID, Visitor Name, Hotel ID, TravelCo ID, SkiSchool ID, other attributes …)
LINK (TravelCo ID, SkiSchool ID)
The final result of entity relationship modeling is an EAM model diagram, a set of table definitions and details of attributes (names, size and domain).
The database can then be implemented. Another example of the use of entity relationship modeling to develop a GIS database is given by Healey (1991).

Linking Spatial and Attribute Data

The relationship between GIS and databases varies.
For a simple raster GIS, where one cell in a layer of data contains a single value that represents the attributes of that cell, a database is not necessary.
Here the attribute values are likely to be held in the same file as the data layer itself, but there are few ‘real’ GIS like this, and those which exist are designed for analysis, rather than attribute data handling.
An improvement on this approach is the ability to handle attribute values in a file separate from the raster image, and allows linking the GIS software with proprietary relational DBMS to upgrade the capabilities.
Most GIS, particularly vector-based systems, offer a hybrid approach (Batty, 1990; Maguire et al., 1990; Cassettari, 1993).
In this case spatial data are stored as part of the GIS data structure and attribute data are stored in a relational DBMS.
This approach allows integration of existing databases with graphics by the allocation of a unique identifier to each feature in the GIS.
Finally, an alternative approach is an extended GIS, where all aspects of the spatial and attribute data are in a single DBMS.
Seaborn (1995) considers these ‘all-relational’ GIS to have considerable potential, and cites examples of major organizations such as British Telecom, Electricité de France and New Zealand Lands which have adopted this approach.

GIS Database Applications

GIS databases cover a wide spectrum of applications, from those involving a single user with a PC and low-cost software working on small-scale research projects to huge corporate databases with data distributed over several sites, each with different computer systems and different users.
Worboys (1995) offers a classification of potential database applications:
- Single-user small databases
- Corporate databases
- Office information systems
- Engineering databases
- Bibliographic databases
- Scientific databases
- Image and multimedia databases
- Geographic databases
For a single user, working on a PC database, issues and design considerations will be very different from those for a large, multi-user corporate database.
For the single user, flexibility and ease of use may be important, whereas security, reliability (the probabil- ity of the system running at a given time), integrity, performance and concurrent access by different users, often using intranet or Internet systems, may be required by a large-scale, multi-user application.
Many GIS applications are now large-scale corporate systems – for instance, in the gas, electricity, water and telecommunications industries.
Large corporate GIS projects have special database demands.
In some organizations GIS may be operated by a specialist department which controls access and manages the data.
Increasingly, however, GIS are being integrated into overall information strategies, requiring the integration of GIS and general business data.
There are two options for such databases:
- Centralized database system
- Distributed data- base system

Centralized Database System

All the system components reside at a single computer or site.
In Happy Valley, a centralized system may be available at the town hall, controlled and updated by council employees, and accessible to other users via terminals distributed throughout the town.

Distributed Database Systems

Trend is away from central servers to databases located in different places but connected together by net- works.
The local user should believe that all the data they are using are located on their machine, even though some may be on the other side of the world.
Using the Internet is an example.
Elmasri and Navathe (1994) describe such a distributed database system as one in which a collection of data that belongs logically to the same system is physically spread over the sites of a computer network.
If the multinational company, SkiResorts Inc., were to buy the whole of Happy Valley (ski schools, hotels, tourist offices, etc.) it may decide to implement a distributed database for visitor monitoring and management.
In this case, individual databases held by the ski schools and hotels could still be maintained by the origina- tors but made accessible to other authorized users, including the multinational company’s headquar- ters, via a communications network.
Worboys (1995) considers that the distributed database approach is naturally suited to many GIS applications.

Advantages of a Distributed Database

The reliability of the system is good since, if one of the database sites goes down, others may still be running.
The approach also allows data sharing, whilst maintaining some measure of local control, and improved database performance may be expe- rienced.
In Happy Valley some of the problems of different users holding their own databases could be overcome by a distributed database approach.

Large corporate database applications are characterized by the following needs:

The need for concurrent access and multi-user update, ensuring that when several users are active in the same part of the database, and could inadvertently update the same data simultaneously, they do so in an orderly manner, with the result they collectively intended.
The need to manage long transactions which is composed of many smaller updates, taking place over a period of hours or days by one or several users.
The need for multiple views or different win- dows into the same database.

Specialist needs make large-scale GIS databases difficult to implement. McLaren (1990) outlined some of these as:

Different basic spatial units used for different data
Data copyright and data ownership issues, large data volumes, formulation of projects at depart- mental level and institutional structures

Sanderson (2005) identifies other practical problems, all with financial implications for corporate users:

Expensive commercial products.
High overheads.
The need for end user training.

A variety of considerations will combine to ensure a successful database application, generally considered to be:

Contemporaneous data
Data as detailed as necessary for potential applications
Positionally accurate and internally accurate data
Up-to-date and maintained data
Data that are accessible to users.

Databases should be usable by non-technical experts who do not need to know the principles of databases to add, edit, query or output data from the database.

General problems with a database approach are complexity, cost, inefficiencies in processing and rigidity:

Complexity of DBMS means that training is required to design and maintain the database and applications.
Costs include software, development, design phases, maintenance, and data storage.
Processing inefficiencies may be caused by changes in user requirements and if changes require restructuring of the database, this may be difficult to implement.
Data formats offered by DBMS are fixed and there are limited capabilities, in most DBMS, for the handling of long text strings, graphics and other types of data.

All of these general problems apply to GIS databases, and there is an additional set of problems associated with data back-up, recovery, auditing, security, data integrity and concurrent update (Batty, 1990).

Geographical data have three elements: space, theme and time.

Handling of:

Spatial data (Chapter 3).
Thematic attribute data (current chapter).
Limitations of a relational database is the inability to handle temporal data effectively.

Another area of current interest is the implementation of databases and GIS using the Internet

Without a population registration system in the United Kingdom, population censuses are very important sources of data for social science research and policy making.

Real-time databases are an interesting area of current development.

Developments In Databases

There are problems with the relational approach to the management of spatial data, where spatial data do not naturally fit into tabular structures, and the SQL query language does not have capabilities for spatial ideas and concepts.
MySQL is an example of a widely used, open-source relational database management system which has implemented spatial extensions
Karlsson (no date) outlines the extensions necessary to allow SQL-based relational databases such as My SQL to accommodate GIS objects and operations.

Attention has also been focused on the development of object-oriented (OO) approaches to database design and there are some examples of OO database principles in current GIS.

The object-oriented database approach offers the opportunity to move away from the geometry-centric’ data models that present the world as collections of points, lines and polygons. Fundamental aim of OO model is to allow data modeling that is closer to real-world things and events:

The method for achieving this is to group together all the data describing a real-world entity, together with any operations that are appropriate to the entity, into an object.
Key feature of an OO database is the power given to the user to specify both the structure of these objects and the operations that can be applied to them (Elmasri and Navathe, 1994).
So, each entity is modeled as an object: Object = state + behavior.
The ‘state’ of an object is the set of values of its attributes and the ‘behaviour’ represents the meth- ods of operating on it.
The identity of an object is unique and does not change during its lifetime.
Composite objects, which are made up of more than one other object, can be created, and an object can belong to a superclass of objects, and then will inherit all the properties of this superclass.

The OO approach is possibly more appropriate for geographical data than the relational model, since it allows the modeling of complex, real-world objects, does not distinguish between spatial and attribute data and is appropriate for graphics opera- tions.

Longley et al. (2010) list the three particular features of object data models that make them par- ticularly good for modeling geographic systems:

Encapsulation – the packaging together of the description of state and behaviour in each object.
Inheritance – the ability to re-use some or all of the characteristics of one object in another object.
Polymorphism – the specific implementation of operations like draw, create or delete for each object.
However, Goodchild (1990) points out that there are problems with the approach, since many geographi- cal ideas have implicit uncertainty and the spatial objects that we require our databases to model are often the products of interpretation or generaliza- tions.
Therefore, it can be difficult to represent the world as rigidly bounded objects.
Other disadvan- tages are that the methods are still under research and development.
For the user there are the addi- tional problems that, to date, there is no standard data model, no clear theoretical base for the OO model and no standard query language.