what is a computer: an electronic device that stores and processes digital info and follows programmed logic and instruction set
what are the 5 components of a computer system: CPU, memory, control unit, I/O units
What is the CPU (Central processing unit): circuity that carries out instructions of computer program by using arithmetic, logical, and controlling I/O operations
What is the speed of the CPU controlled by: the system clock
what does the system clock do: generate electronic pulses at regular intervals to coordinate CPU activities making sure even the slowest operation can finish
how is the performance of a CPU measured: in clock speed (in GHz) and FLOPS (floating point operations per second)
what does FLOPS tell us: how fast a single computation can be done by the CPU
what does clock speed tell us: how many instructions performed per second by the CPU
note: FLOPS have increased with time, but clock speed has saturated and leveled out over time
What is the thermal brick wall: clock-rate reached upper limit because we need more cooling power
why do we need more cooling power: higher CPU speed = higher clock rate = faster electric current = higher current = more heat = lower signal-noise ratio
what is the current thermal brick wall at: hard to get > 4.0 GHz
What are memory modules: any physical device capable of storing information for immediate use
note: memory is not directly controlled by the CPU
note: it requires persistent power to operate
What is parallel computing: computation where many calculations are carried out simultaneously by breaking big problem into smaller ones and solving them concurrently
what is computational gain: serial time / parallel time
what is parallel efficiency: computational gain / number of processors
what is serial computing: a single processor running the computer program
what is shared memory parallelism (OpenMP): multiple processors or threads working on different parts of the program, but share memory, sometimes competing for resources which slows down the process
what is distributed parallelism (Message Passing Interface): multiple processors working separately without having to contend with resources
what's the issue with distributed parallelism: communication between processes is much more difficult
What is a supercomputer: computer cluster made of nodes (connected computers) that work together as a single system
describe the schematic of Midway cluster: workstation (pc) connects to internet which talks to Midway login nodes which connects to Midway computer nodes (which does not directly connect to internet)
What is an OS: operating system is software closest to the computer hardware that manages all hardware and software, abstracting hardware from user programs
What is SSH: secure shell is a cryptographic network protocol for operating network services over an unsecured network, it uses encryption to secure connect b/t client and server (used for connecting to remote super computer)
What are the statistics of the Midway3 compute nodes: 192gb of memory, 100Gbps network, 24 cores, 3 GHz base frequency
What is the storage of Midway3: 2.2 PB
What is the shell: text-based terminal that takes in keyboard input and outputs to text
What is SLURM: workload manager that schedules jobs and manages resources between multiple users
What are the 3 V's of big data
Volume of data, users, and connected devices
Velocity of data transfers, new users, and new devices
Variety of types of data and whether data is structured or unstructured
Veracity or the accuracy of the data
What is structured data: data that is formatted to be easily used with other databases
what are examples of structured data: databases, JSON, HTML CSV, etc.
What is unstructured data and what are examples: data that is not structured ie. web pages, documents, pdfs, emails, media, sensor data
what are the challenges of unstructured data: they must be cleaned, outliers removes, pre-processes, edited, scraped, integrated, and prepared and analyzed
What are the 4 rules of data trust:
not all data is trustworthy
not all trustworthy data is correct
untrustworthy data is not always incorrect
even if data is correct, answer may be wrong
Why is data visualization necessary: it can help create meaningful interpretations of results