Lecture 5: Data Analysis as Workflows, Lecture 6: Parallel Processing, Lecture 7: Distributed Computing

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/29

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

30 Terms

1
New cards

Workflow

Composition of functions

2
New cards

Computational workflows

Composition of programs

  • No user interaction during execution

  • No cycles allowed!

<p>Composition of programs </p><ul><li><p>No user interaction during execution </p></li><li><p>No cycles allowed! </p></li></ul><p></p>
3
New cards

Workflow Component

Represents a function (or computation) in the workflow that is implemented as a program with inputs, parameters and outputs

<p>Represents a function (or computation) in the workflow that is implemented as a program with inputs, parameters and outputs</p>
4
New cards

Electronic Lab Notebooks

A Popular Alternative, Record data, software, results, notes, etc

  • Records what code was run when generating a result

  • Can re-run with new data

5
New cards

Algorithmic Complexity

Linear complexity: When its execution time grows linearly with the size of the input data

Polynomial complexity: When its execution time is bound by a polynomial expression in the size of the input data (e.g., n³)

Exponential complexity: When bound by an exponential expression (e.g., 2n)

6
New cards

Algorithmic Complexity: Big “O” Notation

n: size of the input data

Linear complexity: O(n)

Polynomial complexity: O(nk)

Exponential complexity: O(kn)

7
New cards

Divide-and-Conquer Strategy

Divide into pieces → solve each piece separately → combine the individual results

*Requires that each piece is independent of the other

8
New cards

Dependencies and Message Passing

The steps within an algorithm may have significant interdependencies and require exchanging information, and may not be easily parallelizable

9
New cards

Speedup

S = TimeSequential/TimeParallel

10
New cards

Critical Path

Consists of consecutive steps that are interdependent and therefore not parallelizable

11
New cards

Amdahl’s Law

Theoretical speedup in the execution of a task where p is the percent (proportion) of the task that is parallelizable (1/1-p)

12
New cards

Embarrassingly Parallel

Cleanly separable and can be carried out in parallel, typically with significant speedups

13
New cards

Multi-core computing

Several processors (cores) in the same computer

<p>Several processors (cores) in the same computer</p>
14
New cards

Shared memory

knowt flashcard image
15
New cards

Distributed memory

knowt flashcard image
16
New cards

Mixed-memory architecture

knowt flashcard image
17
New cards

Multi Core Chips

knowt flashcard image
18
New cards

Graphical Processing Units (GPUs)

Designed to do simple computations to display graphics and are very cheap

19
New cards

Floating points operations per second (FLOPS)

How the speed of supercomputers are measured

<p>How the speed of supercomputers are measured</p>
20
New cards

Distributed computing

A parallel computing paradigm where individual cores do computations that are orchestrated over a network (e.g., the Internet)

<p>A parallel computing paradigm where individual cores do computations that are orchestrated over a network (e.g., the Internet) </p>
21
New cards

Web services

An approach to distributed computing where third parties offer services for remote execution that can be orchestrated to create complex applications

<p>An approach to distributed computing where third parties offer services for remote execution that can be orchestrated to create complex applications</p>
22
New cards

Grid computing

An approach to distributed computing where the computing power of several computers of a different nature are orchestrated through a central “middleware” control center

23
New cards

Cluster computing

An approach to distributed computing where the processing of several computers of very similar nature is orchestrated through a central head node

24
New cards

Virtual Machines

Frozen versions of all the software in a machine that is needed to run an application, including OS, programming language support, libraries, etc.

25
New cards

Parallel programming languages

Languages that contain special instructions to use multiple processing and memory units

26
New cards

MapReduce and Hadoop

  1. Provide a programming language to implement a divide-and-conquer paradigm for distributing computations

    1. Split (map)

    2. Process

    3. Join (reduce)

  2. Manage execution failures automatically

Map Reduce was developed at Google and is proprietary. Hadoop has equivalent functionality but is publicly available (open source)

Set up to run on clusters

Hadoop has created an ecosystem of associated software with useful functionality (Spark, Pig, Hive, etc.)

27
New cards

Repeatability

The same lab can re-run a (data analysis) method with the same data and get the same results

28
New cards

Reproducibility

Another lab can re-run (a data analysis) method with the same data and get the same results

Should always be possible

29
New cards

Replicability

Another lab can run the same analysis with the same or different methods or data and get consistent results

Lack of this is just as important as success in this

30
New cards

Generalizability

Result of a study apply in other context or applications