Looks like no one added any tags here yet for you.
crowdsourcing
tapping into the collective intelligence of a large group of people to achieve a specific goal or solve a problem
citizen science
type of scientific research that is conducted by distributed individuals who contribute relevant data to research using their own computer devices
examples of data visualization
charles minard’s map of napoleon’s 1912-13 russian campaign
john snow’s mapping of cholera outbreaks in 1854
common misrepresentation in data
correlation does NOT mean causation
simpson’s paradox
groups of data individually trend in one direction but when combined, the trend disappears or is reversed
first electronic spreadsheets
visicalc (apple II) and lotus 1-2-3 (IBM PC)
rows
go from left to right, labeled using numbers (1…)
columns
go from top to bottom, labeled using letters (a…)
cell
each individual piece of grid (labeled with letter-number)
label
text that descirbes some part of the spreadsheet
constant
any number the user enters into the spreadsheet
formula
equation that can perform calculations on existing cells
formatting
makes data more visually appealing
right click on cell, select formatting, format cells and check under home tab
conditional formatting
highlights cells that meet a specific criteria (= a value, > value, top __), can also turn cells into mini graphs
auto formatting
pre-made templates to change spreadsheet
select cells, select formatting, format as table
database
organized collection of data stores in tables
data is consistent
consistency in databases
info in one table does not contradict itself in any other table
idempotency
an operation will result in the same end result no matter how many times its performed
write-ahead logging
all changes are written and saved to a log before they’re applied to the database so all components in a transaction need to be carried out before the transaction is considered complete
atomic transactions
transactions that cannot be broken down while being executed
rollback
returning back to the state before the write-ahead log began
deadlock
when two transactions are trying to lock the same row and neither can continue until the other is complete
one of the transactions must be rolled back
two-phase commit protocol
standardized way to make sure all data can be written without any inconsistencies
first phase: check to see if all processes can be completed
second: if they can be written without issue, then the processes will be committed
if not, the phase will rollback
relational database
have multiple tables that are connected or related through the use of unique keys, a column holding a unique value that distinguishes each record from all others
virtual table
temporary tables made up of parts of other tables that help to reduce redundant data
SQL (structured query language)
language used to manage, access, and manipulate relational databases
ignores white space and is not case-sensitive
SELECT class_year FROM students;
returns every student’s class year, including duplicates
distinct
only lists unique values
order by
sort the data
desc/asc
gives descending/ascending order
limit
limits to a certain # of rows
using * specifies all columns
where
used to get records that match specific criteria
can be used in conjunction with and/or/not
like %
% represents 0 or more unknown characters
“LIKE ‘W%’” would return entries of any length that start w capital W
LIKE %ing would return only entries ending in ‘ing’
like _
_ represents exaclty one character
LIKE T_m would return Tim, Tom, Tum
join
combines entries from two or more tables
using ON specifies how the tables being joined are related
fault-tolerance
ability of a system to continue to run properly even if one piece fails
big data
sets of data that are larger than a consumer software application can handle