Data Fundamentals Notes
Numerical arrays are a powerful and efficient way to represent data, especially in the context of images, sounds, videos, scientific data, and 3D graphics. These arrays enable efficient processing and manipulation of large datasets, which is crucial for modern computing tasks such as machine learning and data analysis.
Key Concepts
ndarray: Short for 'n-dimensional array,' this data structure is the core component of the NumPy library, enabling high-performance numeric computing in Python. ndarrays can hold a variety of data types and are optimized for computation, allowing for complex operations on large datasets without significant performance drawbacks.
Vectorization: A technique that allows operations to be performed on whole arrays of values, rather than element by element. This is crucial for speeding up computations as it minimizes the need for explicit loops, leveraging low-level optimizations in libraries like NumPy.
GPUs: Graphics Processing Units are specifically designed to handle computations involving large numerical arrays, significantly outperforming Central Processing Units (CPUs) in parallel processing tasks. They facilitate faster computations for tasks requiring extensive data manipulation.
Array Operations
Essential operations on arrays include creating, indexing, slicing, joining, and rotating. These operations form the backbone of array manipulation:
Arithmetic: Basic operations like addition () can be applied element-wise to arrays.
import numpy as np x = np.array([1, 2, 3]) y = np.array([4, 5, 6]) z = x + y # Element-wise additionBroadcasting: A powerful feature that extends the shape of arrays to allow for element-wise operations between arrays of different shapes by automatically expanding their dimensions.
a = np.array([1, 2, 3]) b = np.array([[10], [20], [30]]) result = a + b # Broadcasting exampleAggregation: Functions like summation and mean can consolidate data, providing insights quickly and accurately.
sum_value = np.sum(x) # Sum of elements mean_value = np.mean(x) # Mean of elementsSorting and Selection: Techniques such as
argmaxandargsortare used to locate and sort values in an array, facilitating effective data analysis.
indices = np.argsort(x) # Indices to sort x sorted_array = x[indices] # Sorted array
Array Types
Vector: A 1D array that can represent a list of values. Example: . Vectors are fundamental for linear algebra applications.
Matrix: A 2D array that consists of rows and columns, used in various applications including linear transformations. Example: . Matrices operate under matrix multiplication rules, essential for many scientific computations.
matrix = np.array([[1, 2], [3, 4]])
product = np.dot(matrix, matrix) # Matrix multiplication
Tensor: An n-dimensional array, capable of representing more complex data structures like a stack of matrices or multi-channel images. Tensors are foundational in neural networks and deep learning.
Array Properties
Shape: The shape of an array refers to its dimensions. For instance, a 2D array might have a shape of (rows, columns) which defines its structure and data organization.
shape_of_array = matrix.shape # Returns (2, 2)
dtype: The data type that specifies the kind of data contained in the array, such as
float64, which indicates double precision floating point, orint32, a 32-bit integer. Choosing the appropriate dtype is crucial for memory efficiency and performance.
Array Creation
Various NumPy functions facilitate the creation of arrays:
np.array(): Converts sequences (like lists) into numpy arrays, enabling high-performance mathematical operations.
array_from_list = np.array([1, 2, 3])np.zeros(),np.ones(),np.full(): Functions for creating arrays filled with specific values, helpful for initializing weights in machine learning, etc.
zeros_array = np.zeros((2, 3)) # 2x3 array of zerosnp.empty(): Allocates memory for an array without initializing it, allowing for faster array creation when values will be filled in later.
empty_array = np.empty((2, 2)) # 2x2 empty arraynp.arange(): Generates an array of evenly spaced values within a specified range, essential for creating sequences for operations.
range_array = np.arange(start=0, stop=10, step=2) # [0, 2, 4, 6, 8]np.linspace(): Similar toarange(), but generates evenly spaced values over an interval, beneficial for plotting functions and scientific computations.
linspace_array = np.linspace(start=0, stop=1, num=5) # [0. , 0.25, 0.5, 0.75, 1. ]
Array Indexing and Slicing
Indexing in NumPy starts from 0, meaning the first element is accessed with index 0.
Slicing is done using the syntax
array[start:stop:step], allowing for extracting subarrays efficiently.
sub_array = x[1:3] # Elements from index 1 to 2
Negative Indices: Allow counting from the end of the array, providing flexible access to elements.
last_element = x[-1] # Last element of the array
Array Manipulation
Transposition: The operation
array.Texchanges rows with columns, altering the array's shape but not its data, crucial in various mathematical and statistical algorithms.
transposed_matrix = matrix.T # Transpose of the matrix
Flipping & Rotating: These transformations change the order of elements using slicing and indexing, permitting flexible data manipulation for specific requirements.
flipped_array = np.flip(x) # Flipping the arrayJoining: The functions
np.concatenate()andnp.stack()enable combining multiple arrays into a single array structure, important for aggregating data in preprocessing stages.
joined_array = np.concatenate((x, y)) # Join x and y
Tiling:
np.tile()allows for repeating arrays, which can be useful in simulations and tests.
tiled_array = np.tile(x, 2) # Repeat x twice
Boolean Arrays and Masking
Comparison operations yield Boolean arrays (e.g., x > 5), allowing for easy filtering of data.
boolean_mask = x > 1 # Creates a Boolean array based on the condition
filtered_data = x[boolean_mask] # Applying the mask to filter elements
The function
np.where(condition, a, b)allows for selective element retrieval depending on a condition.
replaced_array = np.where(x > 2, x, 0) # Replace elements conditionally
Fancy Indexing: Enables advanced indexing techniques, where arrays can be indexed with another array of indices, providing powerful selection capabilities.
indices = [0, 2]
fancy_indexed_array = x[indices] # Access elements at indices 0 and 2
Arithmetic and Broadcasting
Supports element-wise operations across arrays, which are computationally efficient.
Broadcasting Rules:
When arrays have different numbers of dimensions, the smaller array is padded with ones (1s) on the left.
If any dimension sizes differ but one is 1, the array with size 1 is repeated (stretched) along that dimension, enabling compatibility in arithmetic operations.
Reduction and Accumulation
Reduction: Involves applying an operator across sequences to condense data (e.g.,
np.sum(), which calculates the total, andnp.max(), which finds the maximum value).
reduced_sum = np.sum(x) # Total of all elements in x
Accumulation: Keeps track of intermediate values during computation using functions such as
np.cumsum(), which calculates cumulative sums, andnp.gradient(), which finds the change in values across the dataset.
cumulative_sum = np.cumsum(x) # Cumulative sum of x
Finding
Functions such as
np.argmax()andnp.argmin()are essential to locate the indices of maximum or minimum elements, aiding in analytical tasks.
max_index = np.argmax(x) # Index of maximum element
The
np.argsort()function provides the indices that would sort an array, useful for organizing data efficiently.
sorted_indices = np.argsort(x) # Indices for sorting x
The function
np.nonzero()returns the indices of non-zero elements, helping in data filtration processes.
```python
nonzero_indices = np.nonzero(x) # Indices of non-zero elements