Data Types Overview

Introduction

  • A data type defines a collection of data objects and a set of predefined operations on those objects

  • A descriptor is the collection of the attributes of a variable

  • An object represents an instance of a user-defined (abstract data) type

  • One design issue for all data types: What operations are defined and how are they specified?

Primitive Data Types

  • Almost all programming languages provide a set of primitive data types

  • Primitive data types are those not defined in terms of other data types

  • Some primitive data types reflect hardware directly; others require little non-hardware support for implementation

Integer

  • Almost always an exact reflection of the hardware, mapping is trivial

  • There may be as many as eight different integer types in a language

  • Java example: signed integer sizes include byte, short, int, long

Floating Point

  • Models real numbers, but only approximately

  • Scientific languages support at least two floating-point types (eg, float and double; sometimes more)

  • Usually implemented largely like the hardware, but not always

  • IEEE Floating-Point Standard 754

Complex

  • Some languages support a complex type (C99, Fortran, Python)

  • Each value consists of two floats: the real part and the imaginary part

  • Literal form in Python: (7 + 3j) where 7 is real and 3 is imaginary

Decimal

  • Important for business applications (money), essential to COBOL

  • C# offers a decimal data type

  • Stores a fixed number of decimal digits in coded form (BCD)

  • Advantages: accuracy; Disadvantages: limited range, potential memory waste

Boolean

  • Simplest data type

  • Range of values: two elements, true and false

  • Often implemented as bits or bytes; readability advantage

Character

  • Stored as numeric codings

  • Common coding: ASCII

  • Alternatives: Unicode (UCS-2) (16-bit), originally used in Java; now supported by many languages

  • 32-bit Unicode (UCS-4)

Character String Types

  • Values are sequences of characters

  • Design issues: is it primitive or a special kind of array? should length be static or dynamic?

Character String Operations

  • Typical operations: assignment and copying, comparison, catenation, substring reference, pattern matching

Character String Type in Certain Languages

  • C and C++: not primitive; use char arrays with a library for operations

  • SNOBOL4: primitive string language with elaborate pattern matching

  • Fortran and Python: primitive type with assignment and several operations

  • Java and C#, Ruby, Swift: primitive via the String class

  • Perl, JavaScript, Ruby, PHP: built-in pattern matching using regular expressions

Character String Length Options

  • Static: COBOL, Java's String class

  • Limited Dynamic Length: C and C++; length indicated by a terminating character instead of a stored length

  • Dynamic (no maximum): SNOBOL4, Perl, JavaScript

Character String Type Evaluation

  • Aid to writability

  • If static length, inexpensive to provide; dynamic length adds expense but offers flexibility

Character String Implementation

  • Static length: compile-time descriptor

  • Limited dynamic length: run-time descriptor may be needed for length (not in C/C++)

  • Dynamic length: run-time descriptor required; allocation/deallocation is a major implementation challenge

Compile- and Run-Time Descriptors

  • Static strings use a compile-time descriptor

  • Limited dynamic strings may need run-time descriptor for length

User-Defined Ordinal Types

  • An ordinal type is one whose range of values can be easily associated with positive integers

  • Examples of primitive ordinal types in Java: integer, char, boolean

Enumeration Types

  • All possible values are named constants provided in the definition

  • Example: enum days {mon, tue, wed, thu, fri, sat, sun};

  • Design issues:

    • Is an enumeration constant allowed to appear in more than one type definition, and how is the type of an occurrence checked?

    • Are enumeration values coerced to integer?

    • Can any other type be coerced to an enumeration type?

Evaluation of Enumerated Type

  • Aids readability, e.g., no need to code a color as a number

  • Aids reliability, compiler checks operations to prevent invalid values (e.g., cannot add colors)

  • Some languages provide better support where enumeration type variables are not coerced into integers (eg, C#, F#, Swift, Java 5.0) compared to C++

Array Types

  • An array is a homogeneous aggregate of data elements identified by position relative to the first element

Array Design Issues

  • What types are legal for subscripts?

  • Are subscripting expressions in element references range-checked?

  • When are subscript ranges bound?

  • When does allocation take place?

  • Are ragged or rectangular multidimensional arrays allowed, or both?

  • What is the maximum number of subscripts?

  • Can array objects be initialized?

  • Are any kind of slices supported?

Array Indexing

  • Indexing maps indices to elements: arrayname(indexvalue_list) -> element

  • Syntax examples:

    • Fortran and Ada use parentheses

    • Ada explicitly uses parentheses to show uniformity between array references and function calls

    • Other languages use brackets

Arrays Index (Subscript) Types

  • FORTRAN, C: integer only

  • Java: integer types only

  • Range checking: C, C++, Perl, and Fortran do not require range checking; Java, ML, C# specify range checking

Subscript Binding and Array Categories

  • Static: subscript ranges are statically bound and storage allocation is static (before run-time)

    • Advantage: efficiency (no dynamic allocation)

  • Fixed stack-dynamic: subscript ranges are statically bound, allocation occurs at declaration time

    • Advantage: space efficiency

  • Fixed heap-dynamic: storage binding is dynamic but fixed after allocation (binding when requested; storage from heap)

  • Heap-dynamic: binding of subscript ranges and storage allocation is dynamic and can change

    • Advantage: flexibility (arrays can grow or shrink during program execution)

C/C++ Arrays and Other Languages

  • C and C++ arrays with static modifier are static

  • C and C++ without static modifier are fixed stack-dynamic

  • C/C++ provide fixed heap-dynamic arrays

  • C# includes an ArrayList class that provides fixed heap-dynamic

  • Perl, JavaScript, Python, and Ruby support various dynamic array types

Array Initialization

  • Some languages allow initialization at storage allocation time

  • Examples:

    • C/C++/Java/Swift/C#/arrays: int list[] = {4, 5, 7, 83};

    • Character strings in C/C++: char name[] = "freddie";

    • Arrays of strings in C/C++: char *names[] = {"Bob", "Jake", "Joe"};

    • Java initialization of String objects: String[] names = {"Bob", "Jake", "Joe"};

Heterogeneous Arrays

  • A heterogeneous array is an array whose elements need not be of the same type

  • Supported by Perl, Python, JavaScript, and Ruby

Arrays Operations

  • APL provides powerful array processing operations for vectors and matrices, including unary operators (eg, reverse)

  • Python supports array assignments as well as array concatenation and element membership operations

  • Ruby also supports array concatenation

Rectangular and Jagged Arrays

  • Rectangular array: all rows have the same number of elements and all columns have the same number of elements

  • Jagged matrix: rows have varying numbers of elements; appears as arrays of arrays

  • C, C++, and Java support jagged arrays; F# and C# support both rectangular arrays

Slices

  • A slice is a substructure of an array; a referencing mechanism only

  • Useful in languages with array operations

Slice Examples

  • Python: vector = [2, 4, 6, 8, 10, 12, 14, 16]

    • mat = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

    • vector(3:6) yields a three-element array

    • mat[0][0:2] yields first two elements of the first row

  • Ruby: list.slice(2, 2) returns the third and fourth elements of list

Implementation of Arrays

  • Access function maps subscript expressions to an address in the array

  • For single-dimension arrays: address(list[k]) = address(list[lowerbound]) + ((k - lowerbound) * element_size)

Accessing Multi-dimensioned Arrays

  • Two common layouts:

    • Row major order (by rows) – used in most languages

    • Column major order (by columns) – used in Fortran

  • A compile-time descriptor for a multidimensional array describes its shape

Locating an Element in a Multi-dimensional Array

  • General formula: Location(a[I,j]) = address of a[rowlb, collb] + (((I - rowlb) * n) + (j - collb)) * element_size

Compile-Time Descriptors

  • For both single- and multi-dimensional arrays the descriptor includes:

    • Array

    • Element type

    • Index type

    • Index lower bound

    • Index upper bound

    • Address

  • For multidimensional arrays the descriptor includes multiple index ranges

Associative Arrays

  • An associative array is an unordered collection of data elements indexed by keys

  • Keys must be stored; size can be static or dynamic

  • Built-in type in Perl, Python, Ruby, and similar languages

Associative Arrays in Perl

  • Named with %; literals in parentheses; example: hi_temps = ("Mon" => 77, "Tue" => 79, "Wed" => 65, …)

  • Subscript using braces and keys: hi_temps{"Wed"} = 83

  • Elements can be removed with delete: delete hi_temps{"Tue"}

Record Types

  • A record is a possibly heterogeneous aggregate of data elements identified by names

  • Design issues:

    • What is the syntactic form of references to the field?

    • Are elliptical references allowed?

Definition of Records in COBOL

  • Example COBOL style using level numbers to show nesting:
    01 EMP-REC.
    02 EMP-NAME.
    05 FIRST PIC X(20).
    05 MID PIC X(10).
    05 LAST PIC X(20).
    02 HOURLY-RATE PIC 99V99.

References to Records

  • Record field references:

    1. COBOL style: fieldname OF recordname1 OF … OF recordname_n

    2. Other languages: recordname1.recordname2. … .field_name

  • Fully qualified references must include all record names

  • Elliptical references allow leaving out record names when unambiguous (eg, in COBOL FIRST, FIRST OF EMP-NAME, FIRST of EMP-REC)

Evaluation and Comparison to Arrays

  • Records are used when data values are heterogeneous

  • Access to array elements is generally slower than access to record fields because subscripts are dynamic whereas field names are static

  • Dynamic subscripts could be used with record field access, but would weaken type checking and slow performance

Implementation of Record Type

  • Offset address relative to the beginning of the record is associated with each field

Tuple Types

  • A tuple is similar to a record but elements are not named

  • Used in Python, ML, and F# to allow functions to return multiple values (Python example: myTuple = (3, 5.8, 'apple'))

  • Referenced with subscripts (begin at 1)

List Types

  • Lists in Lisp and Scheme are delimited by parentheses and use no commas, eg (A B C D) and (A (B C) D)

  • Data and code share the same form: as data, (A B C) is the list itself; as code, (A B C) is the function A applied to parameters B and C

  • The interpreter uses quotation to distinguish lists as data: ' (quote) or an apostrophe

List Types (continued)

  • Scheme: CAR returns the first element; CDR returns the remainder; CONS combines an element with a list; LIST creates a new list from parameters

  • ML: lists written in brackets with elements separated by commas; elements must be of the same type

  • The Scheme CAR and CDR are called hd and tl in ML

  • F# and ML: pattern matching and type-safe lists

  • Python: lists are mutable and can contain mixed types; lists serve as arrays; subscripting starts at 0; list operations include append, delete, and comprehensions

  • List comprehensions in Python and Scheme share the idea of building lists from existing sequences

  • Haskell and ML produce list comprehensions; C# and Java provide List and ArrayList generic collections

Unions Types

  • A union is a type whose variables can store different type values at different times during execution

  • Design issue: should type checking be required?

Discriminated vs Free Unions

  • C and C++ provide unions with no language level type checking (free unions)

  • Type checking for unions requires a discriminant

  • Discerned by ML, Haskell, and F#

Unions in F

  • Defined with a type statement using OR

  • Example: type intReal = | IntValue of int | RealValue of float;;

  • IntValue and RealValue are constructors

  • To create values: let ir1 = IntValue 17;; let ir2 = RealValue 3.4;;

Unions in F# (continued)

  • Accessing union values uses pattern matching: match value with | IntValue v -> … | RealValue v -> …

  • Example of display function using pattern matching

Evaluation of Unions

  • Free unions are unsafe due to lack of type checking

  • Java and C# do not support unions in the same way; ML, F# support discriminated unions; safety concerns

Pointer and Reference Types

  • A pointer type variable can hold memory addresses and a special value nil

  • Provide indirect addressing and a way to manage dynamic memory

  • A pointer can access storage in the heap

Design Issues of Pointers

  • Scope and lifetime of a pointer variable

  • Lifetime of a heap-dynamic variable

  • Are pointers restricted in what they can point to?

  • Are pointers used for dynamic storage management, indirect addressing, or both?

  • Should the language support pointer types, reference types, or both?

Pointer Operations

  • Two fundamental operations: assignment and dereferencing

  • Assignment sets a pointer to a useful address

  • Dereferencing yields the value stored at the location

  • Dereferencing can be explicit or implicit

  • Example: in C++, *j = *ptr assigns j to the value located at ptr

Pointer Assignment Illustrated

  • (diagram from slides showing pointer, heap, and dynamic variable interactions)

Problems with Pointers

  • Dangling pointers: pointer points to a heap-dynamic variable that has been deallocated

  • Lost heap-dynamic variable (garbage): allocated variable no longer accessible

  • Memory leakage occurs when a heap-dynamic variable is no longer reachable but not deallocated

Pointers in C and C++

  • Extremely flexible but require careful use

  • Pointers can point to any variable regardless of allocation timing

  • Used for dynamic storage management and addressing

  • Pointer arithmetic is possible; explicit dereferencing and address-of operators

  • Void* can point to any type and can be type checked but cannot be dereferenced

Pointer Arithmetic in C and C++

  • Example: float stuff[100]; float *p; p = stuff; *(p+5) is equivalent to stuff[5]; p[i] also equivalent

Reference Types

  • C++ includes a special kind of pointer called a reference type used for parameters

  • Advantages of both pass-by-reference and pass-by-value

  • Java uses references to objects and avoids raw pointers

  • Java references replace pointers; C# combines aspects of both

Evaluation of Pointers

  • Dangling pointers and objects are problems, as is heap management

  • Pointers widen the set of accessible cells; necessary for dynamic data structures

Representations of Pointers

  • Large computers use single values; Intel uses segment and offset

Dangling Pointer Problem

  • Tombstone technique: extra heap cell holding a pointer to the heap-dynamic variable

  • When deallocated, tombstone remains but is set to nil (costly in time and space)

  • Locks-and-keys: pointer values are (key, address) pairs

  • Heap-dynamic variables are represented as (variable, lock) pairs; lock value updated on allocate/free

Heap Management

  • Run-time management is very complex

  • Single-size cells vs variable-size cells

  • Two approaches to reclaim garbage:

    • Reference counters (eager): reclamation happens gradually

    • Mark-sweep (lazy): reclaim when space becomes exhausted

Reference Counter

  • Maintain a count of how many pointers point to a cell

  • Disadvantages: extra space, per-operation overhead, circular references issues

  • Advantage: incremental reclamation avoids long waits

Mark-Sweep

  • Allocation occurs; marking phase identifies reachable cells

  • All heap cells have a mark bit; initially all are considered garbage

  • Traverse pointers to mark reachable cells as not garbage

  • Then reclaim all unmarked cells

  • Disadvantages: historical implementations caused long pauses; modern variants mitigate this

Variable-Size Cells

  • More complex heap management problems

  • If mark-sweep is used, initial indicators and marking become harder; managing free space becomes more complex

Optional Types

  • Useful when a variable may have no value

  • Languages with optional types include C#, F#, and Swift

  • In C#, references are already optional via null; value types can be optional using a question mark in declarations, e.g., int? x;

  • The no-value is null; testing for null is used

Type Checking

  • Generalizes operands and operators to include subprograms and assignments

  • Type checking ensures operands of an operator are compatible

  • A compatible type is legal for the operator or can be implicitly converted via coercion

  • A type error occurs when an operator is applied to an inappropriate type

Static vs Dynamic Type Checking

  • If all type bindings are static, most type checking can be static

  • If type bindings are dynamic, type checking must be dynamic

Strong Typing

  • A programming language is strongly typed if type errors are always detected

  • Examples: C and C++ are not strongly typed due to unsafe coercions; Java and C# are almost strong; ML and F# are strongly typed

  • Coercion rules affect strong typing and can weaken it (C++ vs ML/F#)

  • Java's assignment coercions are fewer than C++, but Java's strong typing is still weaker than Ada in some respects

Name Type Equivalence

  • Two variables have equivalent types if they are in the same declaration or share the same type name

  • Easy to implement but restrictive (subtypes, subranges, and actual vs formal parameter type matching are limited)

Structure Type Equivalence

  • Two variables have equivalent types if their structures are identical

  • More flexible but harder to implement

Type Equivalence (continued)

  • Questions about structural equivalence across records, arrays with different bounds, or enums with differently spelled components

  • Structural equivalence cannot distinguish types that have identical structures but different intended meanings

Theory and Data Types

  • Type theory is a broad area in math, logic, CS, and philosophy

  • Two branches in computer science:

    • Practical: data types in commercial languages

    • Abstract: typed lambda calculus

  • A type system is a set of types and a collection of rules that define how types interact

  • Formal models include attribute grammars or type maps, finite mappings modeling arrays, functions, tuples, records, unions, subtypes, etc.

Summary

  • Data types of a language largely determine its style and usefulness

  • Primitive data types include numeric, character, and Boolean types

  • User-defined types such as enumerations and subrange types improve readability and reliability

  • Arrays and records are common constructs

  • Pointers enable addressing flexibility and dynamic storage management

  • Optional types expand expressiveness for missing values

  • Type checking and typing discipline (static/dynamic, strong typing) shape safety and usability

  • Type equivalence concepts (name vs structure) influence language design decisions

  • Type theory connects practical language design with foundational principles