Data Types Overview

Introduction

A data type defines a collection of data objects and a set of predefined operations on those objects
A descriptor is the collection of the attributes of a variable
An object represents an instance of a user-defined (abstract data) type
One design issue for all data types: What operations are defined and how are they specified?

Primitive Data Types

Almost all programming languages provide a set of primitive data types
Primitive data types are those not defined in terms of other data types
Some primitive data types reflect hardware directly; others require little non-hardware support for implementation

Integer

Almost always an exact reflection of the hardware, mapping is trivial
There may be as many as eight different integer types in a language
Java example: signed integer sizes include byte, short, int, long

Floating Point

Models real numbers, but only approximately
Scientific languages support at least two floating-point types (eg, float and double; sometimes more)
Usually implemented largely like the hardware, but not always
IEEE Floating-Point Standard 754

Complex

Some languages support a complex type (C99, Fortran, Python)
Each value consists of two floats: the real part and the imaginary part
Literal form in Python: (7 + 3j) where 7 is real and 3 is imaginary

Decimal

Important for business applications (money), essential to COBOL
C# offers a decimal data type
Stores a fixed number of decimal digits in coded form (BCD)
Advantages: accuracy; Disadvantages: limited range, potential memory waste

Boolean

Simplest data type
Range of values: two elements, true and false
Often implemented as bits or bytes; readability advantage

Character

Stored as numeric codings
Common coding: ASCII
Alternatives: Unicode (UCS-2) (16-bit), originally used in Java; now supported by many languages
32-bit Unicode (UCS-4)

Character String Types

Values are sequences of characters
Design issues: is it primitive or a special kind of array? should length be static or dynamic?

Character String Operations

Typical operations: assignment and copying, comparison, catenation, substring reference, pattern matching

Character String Type in Certain Languages

C and C++: not primitive; use char arrays with a library for operations
SNOBOL4: primitive string language with elaborate pattern matching
Fortran and Python: primitive type with assignment and several operations
Java and C#, Ruby, Swift: primitive via the String class
Perl, JavaScript, Ruby, PHP: built-in pattern matching using regular expressions

Character String Length Options

Static: COBOL, Java's String class
Limited Dynamic Length: C and C++; length indicated by a terminating character instead of a stored length
Dynamic (no maximum): SNOBOL4, Perl, JavaScript

Character String Type Evaluation

Aid to writability
If static length, inexpensive to provide; dynamic length adds expense but offers flexibility

Character String Implementation

Static length: compile-time descriptor
Limited dynamic length: run-time descriptor may be needed for length (not in C/C++)
Dynamic length: run-time descriptor required; allocation/deallocation is a major implementation challenge

Compile- and Run-Time Descriptors

Static strings use a compile-time descriptor
Limited dynamic strings may need run-time descriptor for length

User-Defined Ordinal Types

An ordinal type is one whose range of values can be easily associated with positive integers
Examples of primitive ordinal types in Java: integer, char, boolean

Enumeration Types

All possible values are named constants provided in the definition
Example: enum days {mon, tue, wed, thu, fri, sat, sun};
Design issues:
- Is an enumeration constant allowed to appear in more than one type definition, and how is the type of an occurrence checked?
- Are enumeration values coerced to integer?
- Can any other type be coerced to an enumeration type?

Evaluation of Enumerated Type

Aids readability, e.g., no need to code a color as a number
Aids reliability, compiler checks operations to prevent invalid values (e.g., cannot add colors)
Some languages provide better support where enumeration type variables are not coerced into integers (eg, C#, F#, Swift, Java 5.0) compared to C++

Array Types

An array is a homogeneous aggregate of data elements identified by position relative to the first element

Array Design Issues

What types are legal for subscripts?
Are subscripting expressions in element references range-checked?
When are subscript ranges bound?
When does allocation take place?
Are ragged or rectangular multidimensional arrays allowed, or both?
What is the maximum number of subscripts?
Can array objects be initialized?
Are any kind of slices supported?

Array Indexing

Indexing maps indices to elements: arrayname(indexvalue_list) -> element
Syntax examples:
- Fortran and Ada use parentheses
- Ada explicitly uses parentheses to show uniformity between array references and function calls
- Other languages use brackets

Arrays Index (Subscript) Types

FORTRAN, C: integer only
Java: integer types only
Range checking: C, C++, Perl, and Fortran do not require range checking; Java, ML, C# specify range checking

Subscript Binding and Array Categories

Static: subscript ranges are statically bound and storage allocation is static (before run-time)
- Advantage: efficiency (no dynamic allocation)
Fixed stack-dynamic: subscript ranges are statically bound, allocation occurs at declaration time
- Advantage: space efficiency
Fixed heap-dynamic: storage binding is dynamic but fixed after allocation (binding when requested; storage from heap)
Heap-dynamic: binding of subscript ranges and storage allocation is dynamic and can change
- Advantage: flexibility (arrays can grow or shrink during program execution)

C/C++ Arrays and Other Languages

C and C++ arrays with static modifier are static
C and C++ without static modifier are fixed stack-dynamic
C/C++ provide fixed heap-dynamic arrays
C# includes an ArrayList class that provides fixed heap-dynamic
Perl, JavaScript, Python, and Ruby support various dynamic array types

Array Initialization

Some languages allow initialization at storage allocation time
Examples:
- C/C++/Java/Swift/C#/arrays: int list[] = {4, 5, 7, 83};
- Character strings in C/C++: char name[] = "freddie";
- Arrays of strings in C/C++: char *names[] = {"Bob", "Jake", "Joe"};
- Java initialization of String objects: String[] names = {"Bob", "Jake", "Joe"};

Heterogeneous Arrays

A heterogeneous array is an array whose elements need not be of the same type
Supported by Perl, Python, JavaScript, and Ruby

Arrays Operations

APL provides powerful array processing operations for vectors and matrices, including unary operators (eg, reverse)
Python supports array assignments as well as array concatenation and element membership operations
Ruby also supports array concatenation

Rectangular and Jagged Arrays

Rectangular array: all rows have the same number of elements and all columns have the same number of elements
Jagged matrix: rows have varying numbers of elements; appears as arrays of arrays
C, C++, and Java support jagged arrays; F# and C# support both rectangular arrays

Slices

A slice is a substructure of an array; a referencing mechanism only
Useful in languages with array operations

Slice Examples

Python: vector = [2, 4, 6, 8, 10, 12, 14, 16]
- mat = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
- vector(3:6) yields a three-element array
- mat[0][0:2] yields first two elements of the first row
Ruby: list.slice(2, 2) returns the third and fourth elements of list

Implementation of Arrays

Access function maps subscript expressions to an address in the array
For single-dimension arrays: address(list[k]) = address(list[lowerbound]) + ((k - lowerbound) * element_size)

Accessing Multi-dimensioned Arrays

Two common layouts:
- Row major order (by rows) – used in most languages
- Column major order (by columns) – used in Fortran
A compile-time descriptor for a multidimensional array describes its shape

Locating an Element in a Multi-dimensional Array

General formula: Location(a[I,j]) = address of a[rowlb, collb] + (((I - rowlb) * n) + (j - collb)) * element_size

Compile-Time Descriptors

For both single- and multi-dimensional arrays the descriptor includes:
- Array
- Element type
- Index type
- Index lower bound
- Index upper bound
- Address
For multidimensional arrays the descriptor includes multiple index ranges

Associative Arrays

An associative array is an unordered collection of data elements indexed by keys
Keys must be stored; size can be static or dynamic
Built-in type in Perl, Python, Ruby, and similar languages

Associative Arrays in Perl

Named with %; literals in parentheses; example: hi_temps = ("Mon" => 77, "Tue" => 79, "Wed" => 65, …)
Subscript using braces and keys: hi_temps{"Wed"} = 83
Elements can be removed with delete: delete hi_temps{"Tue"}

Record Types

A record is a possibly heterogeneous aggregate of data elements identified by names
Design issues:
- What is the syntactic form of references to the field?
- Are elliptical references allowed?

Definition of Records in COBOL

Example COBOL style using level numbers to show nesting:
01 EMP-REC.
02 EMP-NAME.
05 FIRST PIC X(20).
05 MID PIC X(10).
05 LAST PIC X(20).
02 HOURLY-RATE PIC 99V99.

References to Records

Record field references:
1. COBOL style: fieldname OF recordname1 OF … OF recordname_n
2. Other languages: recordname1.recordname2. … .field_name
Fully qualified references must include all record names
Elliptical references allow leaving out record names when unambiguous (eg, in COBOL FIRST, FIRST OF EMP-NAME, FIRST of EMP-REC)

Evaluation and Comparison to Arrays

Records are used when data values are heterogeneous
Access to array elements is generally slower than access to record fields because subscripts are dynamic whereas field names are static
Dynamic subscripts could be used with record field access, but would weaken type checking and slow performance

Implementation of Record Type

Offset address relative to the beginning of the record is associated with each field

Tuple Types

A tuple is similar to a record but elements are not named
Used in Python, ML, and F# to allow functions to return multiple values (Python example: myTuple = (3, 5.8, 'apple'))
Referenced with subscripts (begin at 1)

List Types

Lists in Lisp and Scheme are delimited by parentheses and use no commas, eg (A B C D) and (A (B C) D)
Data and code share the same form: as data, (A B C) is the list itself; as code, (A B C) is the function A applied to parameters B and C
The interpreter uses quotation to distinguish lists as data: ' (quote) or an apostrophe

List Types (continued)

Scheme: CAR returns the first element; CDR returns the remainder; CONS combines an element with a list; LIST creates a new list from parameters
ML: lists written in brackets with elements separated by commas; elements must be of the same type
The Scheme CAR and CDR are called hd and tl in ML
F# and ML: pattern matching and type-safe lists
Python: lists are mutable and can contain mixed types; lists serve as arrays; subscripting starts at 0; list operations include append, delete, and comprehensions
List comprehensions in Python and Scheme share the idea of building lists from existing sequences
Haskell and ML produce list comprehensions; C# and Java provide List and ArrayList generic collections

Unions Types

A union is a type whose variables can store different type values at different times during execution
Design issue: should type checking be required?

Discriminated vs Free Unions

C and C++ provide unions with no language level type checking (free unions)
Type checking for unions requires a discriminant
Discerned by ML, Haskell, and F#

Unions in F

Defined with a type statement using OR
Example: type intReal = | IntValue of int | RealValue of float;;
IntValue and RealValue are constructors
To create values: let ir1 = IntValue 17;; let ir2 = RealValue 3.4;;

Unions in F# (continued)

Accessing union values uses pattern matching: match value with | IntValue v -> … | RealValue v -> …
Example of display function using pattern matching

Evaluation of Unions

Free unions are unsafe due to lack of type checking
Java and C# do not support unions in the same way; ML, F# support discriminated unions; safety concerns

Pointer and Reference Types

A pointer type variable can hold memory addresses and a special value nil
Provide indirect addressing and a way to manage dynamic memory
A pointer can access storage in the heap

Design Issues of Pointers

Scope and lifetime of a pointer variable
Lifetime of a heap-dynamic variable
Are pointers restricted in what they can point to?
Are pointers used for dynamic storage management, indirect addressing, or both?
Should the language support pointer types, reference types, or both?

Pointer Operations

Two fundamental operations: assignment and dereferencing
Assignment sets a pointer to a useful address
Dereferencing yields the value stored at the location
Dereferencing can be explicit or implicit
Example: in C++, *j = *ptr assigns j to the value located at ptr

Pointer Assignment Illustrated

(diagram from slides showing pointer, heap, and dynamic variable interactions)

Problems with Pointers

Dangling pointers: pointer points to a heap-dynamic variable that has been deallocated
Lost heap-dynamic variable (garbage): allocated variable no longer accessible
Memory leakage occurs when a heap-dynamic variable is no longer reachable but not deallocated

Pointers in C and C++

Extremely flexible but require careful use
Pointers can point to any variable regardless of allocation timing
Used for dynamic storage management and addressing
Pointer arithmetic is possible; explicit dereferencing and address-of operators
Void* can point to any type and can be type checked but cannot be dereferenced

Pointer Arithmetic in C and C++

Example: float stuff[100]; float *p; p = stuff; *(p+5) is equivalent to stuff[5]; p[i] also equivalent

Reference Types

C++ includes a special kind of pointer called a reference type used for parameters
Advantages of both pass-by-reference and pass-by-value
Java uses references to objects and avoids raw pointers
Java references replace pointers; C# combines aspects of both

Evaluation of Pointers

Dangling pointers and objects are problems, as is heap management
Pointers widen the set of accessible cells; necessary for dynamic data structures

Representations of Pointers

Large computers use single values; Intel uses segment and offset

Dangling Pointer Problem

Tombstone technique: extra heap cell holding a pointer to the heap-dynamic variable
When deallocated, tombstone remains but is set to nil (costly in time and space)
Locks-and-keys: pointer values are (key, address) pairs
Heap-dynamic variables are represented as (variable, lock) pairs; lock value updated on allocate/free

Heap Management

Run-time management is very complex
Single-size cells vs variable-size cells
Two approaches to reclaim garbage:
- Reference counters (eager): reclamation happens gradually
- Mark-sweep (lazy): reclaim when space becomes exhausted

Reference Counter

Maintain a count of how many pointers point to a cell
Disadvantages: extra space, per-operation overhead, circular references issues
Advantage: incremental reclamation avoids long waits

Mark-Sweep

Allocation occurs; marking phase identifies reachable cells
All heap cells have a mark bit; initially all are considered garbage
Traverse pointers to mark reachable cells as not garbage
Then reclaim all unmarked cells
Disadvantages: historical implementations caused long pauses; modern variants mitigate this

Variable-Size Cells

More complex heap management problems
If mark-sweep is used, initial indicators and marking become harder; managing free space becomes more complex

Optional Types

Useful when a variable may have no value
Languages with optional types include C#, F#, and Swift
In C#, references are already optional via null; value types can be optional using a question mark in declarations, e.g., int? x;
The no-value is null; testing for null is used

Type Checking

Generalizes operands and operators to include subprograms and assignments
Type checking ensures operands of an operator are compatible
A compatible type is legal for the operator or can be implicitly converted via coercion
A type error occurs when an operator is applied to an inappropriate type

Static vs Dynamic Type Checking

If all type bindings are static, most type checking can be static
If type bindings are dynamic, type checking must be dynamic

Strong Typing

A programming language is strongly typed if type errors are always detected
Examples: C and C++ are not strongly typed due to unsafe coercions; Java and C# are almost strong; ML and F# are strongly typed
Coercion rules affect strong typing and can weaken it (C++ vs ML/F#)
Java's assignment coercions are fewer than C++, but Java's strong typing is still weaker than Ada in some respects

Name Type Equivalence

Two variables have equivalent types if they are in the same declaration or share the same type name
Easy to implement but restrictive (subtypes, subranges, and actual vs formal parameter type matching are limited)

Structure Type Equivalence

Two variables have equivalent types if their structures are identical
More flexible but harder to implement

Type Equivalence (continued)

Questions about structural equivalence across records, arrays with different bounds, or enums with differently spelled components
Structural equivalence cannot distinguish types that have identical structures but different intended meanings

Theory and Data Types

Type theory is a broad area in math, logic, CS, and philosophy
Two branches in computer science:
- Practical: data types in commercial languages
- Abstract: typed lambda calculus
A type system is a set of types and a collection of rules that define how types interact
Formal models include attribute grammars or type maps, finite mappings modeling arrays, functions, tuples, records, unions, subtypes, etc.

Summary

Data types of a language largely determine its style and usefulness
Primitive data types include numeric, character, and Boolean types
User-defined types such as enumerations and subrange types improve readability and reliability
Arrays and records are common constructs
Pointers enable addressing flexibility and dynamic storage management
Optional types expand expressiveness for missing values
Type checking and typing discipline (static/dynamic, strong typing) shape safety and usability
Type equivalence concepts (name vs structure) influence language design decisions
Type theory connects practical language design with foundational principles