Computers
Computing Device: a machine that can run a program, including computers, tablets, servers, routers, and smart sensors
Computing System: a group of computing devices and programs working together for a common purpose
Computing Network: a group of interconnected computing devices capable of sending or receiving data.
Connecting These Devices
Router: A type of computer that forwards data across a network
Path: the series of connections between computing devices on a network starting with a sender and ending with a receiver.
Redundancy: the inclusion of extra components so that a system can continue to work even if individual components fail, for example by having more than one path between any two connected devices in a network.
Fault Tolerant: Can continue to function even in the event of individual component failures. This is important because elements of complex systems like a computer network fail at unexpected times, often in groups.
Bandwidth: the maximum amount of data that can be sent in a fixed amount of time, usually measured in bits per second.
Packet: A chunk of data sent over a network. Larger messages are divided into packets that may arrive at the destination in order, out-of-order, or not at all.
Packet Metadata: Data added to all packets to help route them through the network and potentially reassemble the original message.
Datastream: Information passed through the internet in packets.
Scalability: the capacity for the system to change in size and scale to meet new demands
Protocols
Protocol: An agreed-upon set of rules that specify the behavior of some system
Internet Protocol (IP): a protocol for sending data across the Internet that assigns unique numbers (IP addresses) to each connected device
User Datagram Protocol (UDP): A protocol for sending packets quickly with minimal error-checking and no resending of dropped packets
Transmission Control Protocol (TCP): A protocol for sending packets that does error-checking to ensure all packets are received and properly ordered
Hypertext Transfer Protocol (HTTP): a protocol for computers to request and share the pages that make up the world wide web on the Internet
Domain Name System (DNS): the system responsible for translating domain names like example.com into IP addresses
Digital Divide
Differing access to computing devices and the Internet, based on socioeconomic, geographic, or demographic characteristics.
Can affect both individual and groups.
Raises ethical concerns of equity, access, and influence globally and locally.
Affected by the actions of individuals, organizations, and governments.
User Interface: the inputs and outputs that allow a user to interact with a piece of software. User interfaces can include a variety of forms such as buttons, menus, images, text, and graphics.
Input: data that are sent to a computer for processing by a program. Can come in a variety of forms, such as tactile interaction, audio, visuals, or text.
Output: any data that are sent from a program to a device. Can come in a variety of forms, such as tactile interaction, audio, visuals, or text.
Program Statement: a command or instruction. Sometimes also referred to as a code statement.
Program: a collection of program statements. Programs run (or “execute”) one command at a time.
Sequential Programming: program statements run in order, from top to bottom.
No user interaction
Code runs the same way every time
Event Driven Programming: some program statements run when triggered by an event, like a mouse click or a key press
Programs run differently each time depending on user interactions
Debugging Strategies
Keep your code clean
Run your code
Use classmates and resources
Documentation: a written description of how a command or piece of code works or was developed.
Comment: form of program documentation written into the program to be read by people and which do not affect how a program runs.
Pair Programming: a collaborative programming style in which two programmers switch between the roles of writing code and tracking or planning high level progress
Correlation does not equal Causation
Metadata: data about data
Visualizations can help us:
Answer questions
Look at lots of data at once
See patterns that are "invisible" if you just look at the table
When does data need to be cleaned?
Data is incomplete
Data is invalid
Multiple tables are combined into one
What leads to "messy" data?
Users enter in different types of data ("two", 2)
Users use different abbreviations to represent the same information ("February", "Feb", "Febr")
Data may have different spellings ("color", "colour") or inconsistent capitalization ("spring", "Spring")
Filtering data allows the user to look at a subset of the data.
Bar Chart: Count how many times each value in the column appears and make a bar at that height.
Information we can get out of bar charts:
What value(s) are most common in this column?
What value(s) are least common in this column?
What is the unique list of values in this column?
Histogram: Similar to a bar chart, but first all numbers in a range or "bucket" are grouped together. For example, the chart below has a bucket size of 20 so the numbers 41, 48, and 53 would all be placed in the same bucket between 40 and 60.
Information we can get out of histograms:
What range of value(s) are most common in this column?
What range value(s) are least common in this column?
What ranges of values do or do not appear in this column?
Histograms can only be created with numeric data but can be useful when a normal bar chart may be difficult to read.
Cross Tab: Counts how often pairs of values in two columns appear.
Information we can get out of cross tab charts:
Finding the most / least common combinations of values in two columns
Finding patterns across two columns
Exploring two columns when one or both are strings.
Not useful if either column has too many values because the chart would be enormous
Scatter Plot: Shows combinations of values from two columns
Information we can get out of scatter plots:
Seeing patterns and trends between two values
Numeric data with lots of different values
Not useful for lots of repeated values
"sharing data with others so they can can analyze it"
Open data is publicly available data shared by governments, organizations, and others
Making data open help spread useful knowledge or creates opportunities for others to use it to solve problems
"collecting data from others so you can analyze it"
Crowdsourcing is the practice of obtaining input or information from a large number of people via the Internet.
Citizen science is research where some of the data collection is done by members of the public using own computing devices which leads to solving scientific problems
Crowdsourcing offers new models for collaboration, such as connecting businesses or social causes with funding
Both are examples of how human capabilities can be enhanced by collaboration via computing
"Collect huge amounts of data so we can learn even more from it"
The size of the datasets we analyzed impacts how much information can be extracted
As a result, in business, science, and many other contexts people are working with increasingly big data sets
When data gets too big it can no longer be processed on one computer. Cloud computing or parallel systems are sometimes used to help process all that information.
In general scalability of your system is important to consider when working with big data. You want your system to be able to work even as you're using more and more data.