Data Types Overview
Introduction
A data type defines a collection of data objects and a set of predefined operations on those objects
A descriptor is the collection of the attributes of a variable
An object represents an instance of a user-defined (abstract data) type
One design issue for all data types: What operations are defined and how are they specified?
Primitive Data Types
Almost all programming languages provide a set of primitive data types
Primitive data types are those not defined in terms of other data types
Some primitive data types reflect hardware directly; others require little non-hardware support for implementation
Integer
Almost always an exact reflection of the hardware, mapping is trivial
There may be as many as eight different integer types in a language
Java example: signed integer sizes include byte, short, int, long
Floating Point
Models real numbers, but only approximately
Scientific languages support at least two floating-point types (eg, float and double; sometimes more)
Usually implemented largely like the hardware, but not always
IEEE Floating-Point Standard 754
Complex
Some languages support a complex type (C99, Fortran, Python)
Each value consists of two floats: the real part and the imaginary part
Literal form in Python: (7 + 3j) where 7 is real and 3 is imaginary
Decimal
Important for business applications (money), essential to COBOL
C# offers a decimal data type
Stores a fixed number of decimal digits in coded form (BCD)
Advantages: accuracy; Disadvantages: limited range, potential memory waste
Boolean
Simplest data type
Range of values: two elements, true and false
Often implemented as bits or bytes; readability advantage
Character
Stored as numeric codings
Common coding: ASCII
Alternatives: Unicode (UCS-2) (16-bit), originally used in Java; now supported by many languages
32-bit Unicode (UCS-4)
Character String Types
Values are sequences of characters
Design issues: is it primitive or a special kind of array? should length be static or dynamic?
Character String Operations
Typical operations: assignment and copying, comparison, catenation, substring reference, pattern matching
Character String Type in Certain Languages
C and C++: not primitive; use char arrays with a library for operations
SNOBOL4: primitive string language with elaborate pattern matching
Fortran and Python: primitive type with assignment and several operations
Java and C#, Ruby, Swift: primitive via the String class
Perl, JavaScript, Ruby, PHP: built-in pattern matching using regular expressions
Character String Length Options
Static: COBOL, Java's String class
Limited Dynamic Length: C and C++; length indicated by a terminating character instead of a stored length
Dynamic (no maximum): SNOBOL4, Perl, JavaScript
Character String Type Evaluation
Aid to writability
If static length, inexpensive to provide; dynamic length adds expense but offers flexibility
Character String Implementation
Static length: compile-time descriptor
Limited dynamic length: run-time descriptor may be needed for length (not in C/C++)
Dynamic length: run-time descriptor required; allocation/deallocation is a major implementation challenge
Compile- and Run-Time Descriptors
Static strings use a compile-time descriptor
Limited dynamic strings may need run-time descriptor for length
User-Defined Ordinal Types
An ordinal type is one whose range of values can be easily associated with positive integers
Examples of primitive ordinal types in Java: integer, char, boolean
Enumeration Types
All possible values are named constants provided in the definition
Example: enum days {mon, tue, wed, thu, fri, sat, sun};
Design issues:
Is an enumeration constant allowed to appear in more than one type definition, and how is the type of an occurrence checked?
Are enumeration values coerced to integer?
Can any other type be coerced to an enumeration type?
Evaluation of Enumerated Type
Aids readability, e.g., no need to code a color as a number
Aids reliability, compiler checks operations to prevent invalid values (e.g., cannot add colors)
Some languages provide better support where enumeration type variables are not coerced into integers (eg, C#, F#, Swift, Java 5.0) compared to C++
Array Types
An array is a homogeneous aggregate of data elements identified by position relative to the first element
Array Design Issues
What types are legal for subscripts?
Are subscripting expressions in element references range-checked?
When are subscript ranges bound?
When does allocation take place?
Are ragged or rectangular multidimensional arrays allowed, or both?
What is the maximum number of subscripts?
Can array objects be initialized?
Are any kind of slices supported?
Array Indexing
Indexing maps indices to elements: arrayname(indexvalue_list) -> element
Syntax examples:
Fortran and Ada use parentheses
Ada explicitly uses parentheses to show uniformity between array references and function calls
Other languages use brackets
Arrays Index (Subscript) Types
FORTRAN, C: integer only
Java: integer types only
Range checking: C, C++, Perl, and Fortran do not require range checking; Java, ML, C# specify range checking
Subscript Binding and Array Categories
Static: subscript ranges are statically bound and storage allocation is static (before run-time)
Advantage: efficiency (no dynamic allocation)
Fixed stack-dynamic: subscript ranges are statically bound, allocation occurs at declaration time
Advantage: space efficiency
Fixed heap-dynamic: storage binding is dynamic but fixed after allocation (binding when requested; storage from heap)
Heap-dynamic: binding of subscript ranges and storage allocation is dynamic and can change
Advantage: flexibility (arrays can grow or shrink during program execution)
C/C++ Arrays and Other Languages
C and C++ arrays with static modifier are static
C and C++ without static modifier are fixed stack-dynamic
C/C++ provide fixed heap-dynamic arrays
C# includes an ArrayList class that provides fixed heap-dynamic
Perl, JavaScript, Python, and Ruby support various dynamic array types
Array Initialization
Some languages allow initialization at storage allocation time
Examples:
C/C++/Java/Swift/C#/arrays: int list[] = {4, 5, 7, 83};
Character strings in C/C++: char name[] = "freddie";
Arrays of strings in C/C++: char *names[] = {"Bob", "Jake", "Joe"};
Java initialization of String objects: String[] names = {"Bob", "Jake", "Joe"};
Heterogeneous Arrays
A heterogeneous array is an array whose elements need not be of the same type
Supported by Perl, Python, JavaScript, and Ruby
Arrays Operations
APL provides powerful array processing operations for vectors and matrices, including unary operators (eg, reverse)
Python supports array assignments as well as array concatenation and element membership operations
Ruby also supports array concatenation
Rectangular and Jagged Arrays
Rectangular array: all rows have the same number of elements and all columns have the same number of elements
Jagged matrix: rows have varying numbers of elements; appears as arrays of arrays
C, C++, and Java support jagged arrays; F# and C# support both rectangular arrays
Slices
A slice is a substructure of an array; a referencing mechanism only
Useful in languages with array operations
Slice Examples
Python: vector = [2, 4, 6, 8, 10, 12, 14, 16]
mat = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
vector(3:6) yields a three-element array
mat[0][0:2] yields first two elements of the first row
Ruby: list.slice(2, 2) returns the third and fourth elements of list
Implementation of Arrays
Access function maps subscript expressions to an address in the array
For single-dimension arrays: address(list[k]) = address(list[lowerbound]) + ((k - lowerbound) * element_size)
Accessing Multi-dimensioned Arrays
Two common layouts:
Row major order (by rows) – used in most languages
Column major order (by columns) – used in Fortran
A compile-time descriptor for a multidimensional array describes its shape
Locating an Element in a Multi-dimensional Array
General formula: Location(a[I,j]) = address of a[rowlb, collb] + (((I - rowlb) * n) + (j - collb)) * element_size
Compile-Time Descriptors
For both single- and multi-dimensional arrays the descriptor includes:
Array
Element type
Index type
Index lower bound
Index upper bound
Address
For multidimensional arrays the descriptor includes multiple index ranges
Associative Arrays
An associative array is an unordered collection of data elements indexed by keys
Keys must be stored; size can be static or dynamic
Built-in type in Perl, Python, Ruby, and similar languages
Associative Arrays in Perl
Named with %; literals in parentheses; example: hi_temps = ("Mon" => 77, "Tue" => 79, "Wed" => 65, …)
Subscript using braces and keys: hi_temps{"Wed"} = 83
Elements can be removed with delete: delete hi_temps{"Tue"}
Record Types
A record is a possibly heterogeneous aggregate of data elements identified by names
Design issues:
What is the syntactic form of references to the field?
Are elliptical references allowed?
Definition of Records in COBOL
Example COBOL style using level numbers to show nesting:
01 EMP-REC.
02 EMP-NAME.
05 FIRST PIC X(20).
05 MID PIC X(10).
05 LAST PIC X(20).
02 HOURLY-RATE PIC 99V99.
References to Records
Record field references:
COBOL style: fieldname OF recordname1 OF … OF recordname_n
Other languages: recordname1.recordname2. … .field_name
Fully qualified references must include all record names
Elliptical references allow leaving out record names when unambiguous (eg, in COBOL FIRST, FIRST OF EMP-NAME, FIRST of EMP-REC)
Evaluation and Comparison to Arrays
Records are used when data values are heterogeneous
Access to array elements is generally slower than access to record fields because subscripts are dynamic whereas field names are static
Dynamic subscripts could be used with record field access, but would weaken type checking and slow performance
Implementation of Record Type
Offset address relative to the beginning of the record is associated with each field
Tuple Types
A tuple is similar to a record but elements are not named
Used in Python, ML, and F# to allow functions to return multiple values (Python example: myTuple = (3, 5.8, 'apple'))
Referenced with subscripts (begin at 1)
List Types
Lists in Lisp and Scheme are delimited by parentheses and use no commas, eg (A B C D) and (A (B C) D)
Data and code share the same form: as data, (A B C) is the list itself; as code, (A B C) is the function A applied to parameters B and C
The interpreter uses quotation to distinguish lists as data: ' (quote) or an apostrophe
List Types (continued)
Scheme: CAR returns the first element; CDR returns the remainder; CONS combines an element with a list; LIST creates a new list from parameters
ML: lists written in brackets with elements separated by commas; elements must be of the same type
The Scheme CAR and CDR are called hd and tl in ML
F# and ML: pattern matching and type-safe lists
Python: lists are mutable and can contain mixed types; lists serve as arrays; subscripting starts at 0; list operations include append, delete, and comprehensions
List comprehensions in Python and Scheme share the idea of building lists from existing sequences
Haskell and ML produce list comprehensions; C# and Java provide List and ArrayList generic collections
Unions Types
A union is a type whose variables can store different type values at different times during execution
Design issue: should type checking be required?
Discriminated vs Free Unions
C and C++ provide unions with no language level type checking (free unions)
Type checking for unions requires a discriminant
Discerned by ML, Haskell, and F#
Unions in F
Defined with a type statement using OR
Example: type intReal = | IntValue of int | RealValue of float;;
IntValue and RealValue are constructors
To create values: let ir1 = IntValue 17;; let ir2 = RealValue 3.4;;
Unions in F# (continued)
Accessing union values uses pattern matching: match value with | IntValue v -> … | RealValue v -> …
Example of display function using pattern matching
Evaluation of Unions
Free unions are unsafe due to lack of type checking
Java and C# do not support unions in the same way; ML, F# support discriminated unions; safety concerns
Pointer and Reference Types
A pointer type variable can hold memory addresses and a special value nil
Provide indirect addressing and a way to manage dynamic memory
A pointer can access storage in the heap
Design Issues of Pointers
Scope and lifetime of a pointer variable
Lifetime of a heap-dynamic variable
Are pointers restricted in what they can point to?
Are pointers used for dynamic storage management, indirect addressing, or both?
Should the language support pointer types, reference types, or both?
Pointer Operations
Two fundamental operations: assignment and dereferencing
Assignment sets a pointer to a useful address
Dereferencing yields the value stored at the location
Dereferencing can be explicit or implicit
Example: in C++, *j = *ptr assigns j to the value located at ptr
Pointer Assignment Illustrated
(diagram from slides showing pointer, heap, and dynamic variable interactions)
Problems with Pointers
Dangling pointers: pointer points to a heap-dynamic variable that has been deallocated
Lost heap-dynamic variable (garbage): allocated variable no longer accessible
Memory leakage occurs when a heap-dynamic variable is no longer reachable but not deallocated
Pointers in C and C++
Extremely flexible but require careful use
Pointers can point to any variable regardless of allocation timing
Used for dynamic storage management and addressing
Pointer arithmetic is possible; explicit dereferencing and address-of operators
Void* can point to any type and can be type checked but cannot be dereferenced
Pointer Arithmetic in C and C++
Example: float stuff[100]; float *p; p = stuff; *(p+5) is equivalent to stuff[5]; p[i] also equivalent
Reference Types
C++ includes a special kind of pointer called a reference type used for parameters
Advantages of both pass-by-reference and pass-by-value
Java uses references to objects and avoids raw pointers
Java references replace pointers; C# combines aspects of both
Evaluation of Pointers
Dangling pointers and objects are problems, as is heap management
Pointers widen the set of accessible cells; necessary for dynamic data structures
Representations of Pointers
Large computers use single values; Intel uses segment and offset
Dangling Pointer Problem
Tombstone technique: extra heap cell holding a pointer to the heap-dynamic variable
When deallocated, tombstone remains but is set to nil (costly in time and space)
Locks-and-keys: pointer values are (key, address) pairs
Heap-dynamic variables are represented as (variable, lock) pairs; lock value updated on allocate/free
Heap Management
Run-time management is very complex
Single-size cells vs variable-size cells
Two approaches to reclaim garbage:
Reference counters (eager): reclamation happens gradually
Mark-sweep (lazy): reclaim when space becomes exhausted
Reference Counter
Maintain a count of how many pointers point to a cell
Disadvantages: extra space, per-operation overhead, circular references issues
Advantage: incremental reclamation avoids long waits
Mark-Sweep
Allocation occurs; marking phase identifies reachable cells
All heap cells have a mark bit; initially all are considered garbage
Traverse pointers to mark reachable cells as not garbage
Then reclaim all unmarked cells
Disadvantages: historical implementations caused long pauses; modern variants mitigate this
Variable-Size Cells
More complex heap management problems
If mark-sweep is used, initial indicators and marking become harder; managing free space becomes more complex
Optional Types
Useful when a variable may have no value
Languages with optional types include C#, F#, and Swift
In C#, references are already optional via null; value types can be optional using a question mark in declarations, e.g., int? x;
The no-value is null; testing for null is used
Type Checking
Generalizes operands and operators to include subprograms and assignments
Type checking ensures operands of an operator are compatible
A compatible type is legal for the operator or can be implicitly converted via coercion
A type error occurs when an operator is applied to an inappropriate type
Static vs Dynamic Type Checking
If all type bindings are static, most type checking can be static
If type bindings are dynamic, type checking must be dynamic
Strong Typing
A programming language is strongly typed if type errors are always detected
Examples: C and C++ are not strongly typed due to unsafe coercions; Java and C# are almost strong; ML and F# are strongly typed
Coercion rules affect strong typing and can weaken it (C++ vs ML/F#)
Java's assignment coercions are fewer than C++, but Java's strong typing is still weaker than Ada in some respects
Name Type Equivalence
Two variables have equivalent types if they are in the same declaration or share the same type name
Easy to implement but restrictive (subtypes, subranges, and actual vs formal parameter type matching are limited)
Structure Type Equivalence
Two variables have equivalent types if their structures are identical
More flexible but harder to implement
Type Equivalence (continued)
Questions about structural equivalence across records, arrays with different bounds, or enums with differently spelled components
Structural equivalence cannot distinguish types that have identical structures but different intended meanings
Theory and Data Types
Type theory is a broad area in math, logic, CS, and philosophy
Two branches in computer science:
Practical: data types in commercial languages
Abstract: typed lambda calculus
A type system is a set of types and a collection of rules that define how types interact
Formal models include attribute grammars or type maps, finite mappings modeling arrays, functions, tuples, records, unions, subtypes, etc.
Summary
Data types of a language largely determine its style and usefulness
Primitive data types include numeric, character, and Boolean types
User-defined types such as enumerations and subrange types improve readability and reliability
Arrays and records are common constructs
Pointers enable addressing flexibility and dynamic storage management
Optional types expand expressiveness for missing values
Type checking and typing discipline (static/dynamic, strong typing) shape safety and usability
Type equivalence concepts (name vs structure) influence language design decisions
Type theory connects practical language design with foundational principles