PK

Big Idea 2: Data and Information - Detailed Notes

Big Idea 2: Data and Information

  • "Data and information is the oil of the 21st century, and analytics is the combustion engine." - Peter Sondergaard, Gartner

Chapter Goals

  • Using bits to represent data

  • Abstractions

  • Analog vs. digital data

  • Consequences of using bits

  • Number systems

  • Converting numbers

  • Overflow errors

  • Roundoff errors

  • Data compression

  • Information from data

  • Predicting algorithms

  • Visualization of data

  • Privacy concerns

  • Metadata

Bits Represent Data

  • A bit is a single binary digit and is either 0 or 1.

  • A byte is composed of 8 bits.

  • Binary sequences can represent all digital data including colors, Boolean logic, and lists.

  • Anything stored on a computer can be represented by binary sequences.

  • Some data take many bits to represent it.

    • Example: A 10 MP (10 million pixels) picture in 16-bit mode uses 10,000,000 pixels.

    • Each pixel contains 6 bytes ( 6 * 8 = 48 bits).

    • Total bits in the picture: 8 * 6 * 10,000,000 = 480,000,000 bits.

  • Videos commonly use 1,000,000 bits per second.

Abstractions

  • Bits are grouped to represent abstractions like numbers, characters, and colors.

  • Abstractions find common features to generalize the program.

  • This can help shrink the code if you are planning to use a method/procedure more than once in a program.

  • Instead of repeating the code lines, you can reference a prior set of directions to repeat the outcome without having to rewrite the lines of code.

  • By reducing the number of lines of code, chances for errors are also reduced.

Examples
  • JAVA: Adding the numbers 1,234 + 4,321 would look like: int x = 1234 + 4321

  • Python: Adding the numbers 1,234 + 4,321 would look like: x = 1234 + 4321

  • Without abstractions (machine code), the same math example would be a complex binary sequence.

  • High-level languages contain the most abstractions, allowing for easier coding and debugging.

  • Abstractions will be covered in more detail in the programming chapter.

Analog vs. Digital Data

  • An analog signal has values that change smoothly over time, rather than in discrete intervals.

    • Examples: pitch, volume of music, colors in a painting, position of a sprinter.

  • Analog signals are continuous signals, while digital signals are discrete time signals.

  • A digital signal is an analog signal that has been broken up into steps.

  • Analog data can be approximated by digital data using a sampling technique.

  • Sampling: measuring values of the analog signal at regular intervals (samples).

  • Samples are measured to figure out the exact bits required to store each sample.

  • The smaller the sample rate, the more accurately the digital signal represents the analog signal.

  • The use of digital data to approximate real-world analog data is an example of an abstraction.

Consequences of Using Bits to Represent Data

  • A variable is an abstraction inside a program that can hold a value.

  • Each variable has associated data storage that represents one value at a time.

  • Value can be a list or other collection that, in turn, contains multiple values.

  • Data types include integers, real numbers, Boolean, string, and list.

    • Integer: 4

    • Real Number: 4.00

    • Boolean: True or False

    • String: “Novack the third”

    • List: [1, 1, 35, 6]

  • In many programming languages, integers are represented by a fixed number of bits which limits the range of integer values and mathematical operations on those values.

  • For example, In JAVA, the range of the value of an integer is from -2,147,483,648 to +2,147,483,647.

  • Trying to store a number bigger than the limits will result in an overflow error.

  • Some languages like Python do not have limits on number size but, instead, expand to the limit of the available memory.

  • The language used on the AP exam, similar to Python, does not have a limit on the size of numbers, but is limited only by the size of the computer’s memory.

  • The test will expect you to know that some computer languages do have limits on the size of data types.

Number Systems

  • Number bases, including binary, decimal, and hexadecimal, are used to represent and investigate digital data.

  • On your AP exam, you will be expected to convert binary to decimal and decimal to binary only.

Decimal

Binary

0

0000

1

0001

2

0010

3

0011

4

0100

5

0101

6

0110

7

0111

8

1000

9

1001

10

1010

11

1011

12

1100

13

1101

14

1110

15

1111

Converting Numbers into Different Bases

Roundoff Errors
  • 1/3 does not always equal 1/3.

  • A roundoff error occurs when decimals (real numbers) are rounded.

  • One computer might calculate 1/3 as 0.333333. Another computer might calculate 1/3 as 0.3333333333.

  • In this case, 1/3 on one computer is not equal to 1/3 on a second computer.

Lossy and Lossless Data Compression

  • Data compression is reducing the size (number of bits) of transmitted or stored data.

  • Fewer bits do not necessarily mean less information.

  • Digital data compression often involves trade-offs in quality versus storage requirements.

  • Lossy compression can significantly reduce the file size while decreasing resolution.

  • Traditionally, lossy compression is used to reduce file size for storage and transmission (email).

  • The trade-off of using lossy data is that you will not recover the original file. Some data will be lost.

  • In lossless data compression, no data are lost. After compression, the original file can be reproduced without any lost data.

  • The trade- off of lossless data compression is larger files that can be difficult to store, transfer, and handle.

  • In situations where quality or the ability to reconstruct the original file is important, lossless compression algorithms are typically chosen.

  • In situations where minimizing data size or transmission time is important, lossy compression algorithms are typically chosen.

Information Extracted from Data

  • People generate significant amounts of digital data daily.

  • Some always-on devices are collecting geographic location data constantly, while social media sites are collecting premium data based on your usage.

  • People can use computer programs to process information as well as to gain insight and knowledge.

  • Information is the collection of facts and patterns extracted from data.

  • Gaining insight from this valuable data involves a combination of statistics, mathematics, programming, and problem solving.

  • Large data sets may be analyzed computationally to reveal patterns, trends, and associations. These trends are powerful predictors of future behaviors.

  • Investors are constantly reviewing trends in past pricing to influence their future investment decisions. However, sometimes trends can be misinterpreted and result in business disasters.

  • Digitally processed data may show correlation between variables. A correlation found in data does not necessarily indicate that a causal relationship exists. Additional research is needed to understand the exact nature of the relationship.

  • Often, the size of the data set affects the amount of information that can be extracted from it. A single source often does not contain the data needed to draw a conclusion. Combining data from variety of sources may be necessary to formulate a conclusion.

  • Depending on how the data were collected, the information may not be uniform. For example, if users entered data into an open field, the way they chose to abbreviate, spell, or capitalize something may vary from user to user.

  • Data sets pose challenges regardless of size, such as:

    • The need to clean data

    • Incomplete data

    • Invalid data

    • The need to combine data sources

  • Cleaning data is a process that makes the data uniform without changing their meaning. One example is replacing all equivalent abbreviations with the same word. This can also be done with various spellings and with different capitalizations.

  • Data can get too large for traditional data-processing applications. The ability to process data depends on the capabilities of the users and their tools. Social media activity generates an enormous amount of data. In the absence of a data-processing application, much of this data will go unexamined. All of the information in the data is too large to examine by hand in real time.

  • Some data sets are difficult to process using a single computer and may require parallel systems. Parallel systems are fully covered in Chapter 5, “Big Idea 4: Computer Systems and Networks.”

  • Problems of bias are often created by the types and sources of data being collected. Bias is not eliminated by simply collecting more data. A large amount of data is generated by humans. Algorithms that use this data will reflect this bias.

  • Despite the advantages of big data, a large sample size can magnify the bias associated with the data being used. Data can have little value if the sample is not representative of the population to which the results will be generalized. Computing bias is covered more completely in Chapter 6, “Big Idea 5: Impact of Computing.”

Predicting Algorithms

  • Predicting algorithms use information collected from big data to influence our daily lives.

    • A credit card company can use purchasing patterns to identify when to extend credit or flag a purchase for possible fraud.

    • Social media sites can use patterns to target advertising based on viewing habits.

    • An online store analyzing customers’ past purchases can suggest new products the customer may be interested in buying.

    • An entertainment application may recommend an additional movie to watch based on the viewer’s interests.

    • Algorithms can be used to prevent crimes by identifying crime “hot spots.” The police can then step up patrols in those areas.

Visualization of Data

  • Using appropriate visualizations when presenting digitally processed data can help one gain insight and knowledge.

  • Although big data is a powerful tool, the data will lose their value if they cannot be presented in a way that can be interpreted.

  • Visualization tools can communicate information about data.

  • Column charts, line graphs, pie charts, bar charts, XY charts, radar charts, histograms, and waterfall charts can make complex data easier to interpret. *Example, graph plots users vs proficts:

    • Company might want to invest in drawing more members or spending on advertisers to draw in new members

  • Predicting trends is not a guarantee of future usage.

  • cannot predict an innovation that could make this current innovation obsolete.

  • It can be dangerous to draw conclusions based on good data and assume that those conclusions apply across the board or that past patterns will remain consistent.

  • Often, a single source does not contain the data needed to draw a conclusion. It may be necessary to combine data from a variety of sources to formulate a conclusion.

  • Predicting algorithms use historical data to predict future events. This data are used to build a mathematical model that encompasses trends. That predictive model is then used on current data to predict what will happen next.

Example Twenty-Four
  • What can be learned from the following data table kept in a pet store?

    • The date when a certain dog food was purchased the greatest number of times

    • The total number of cities in which a certain food was purchased

    • The total number of foods purchased in a certain city during a month

  • E-commerce sites use data to determine how much inventory to hold and how to price products. Additionally, data about product views and purchases power the recommendation engine, which drives a large portion of sales. Data allow for personalized and effective advertisement. Sometimes an e-commerce site knows what you want to buy before you do. Targeted advertisers

Example Twenty-Five
  • A high school principal is interested in predicting the number of students passing a state-level exam. She created a computer model that uses data from third-party software showing an increasing student pass rate for the exam. The model provided by the software company predicts a 90% student pass rate. The actual percentage of students passing the state exam was 74%.

  • When creating a model, all real-world variables cannot be represented. In the case of a model not accurately predicting outcomes, addition information can be added to make a more accurate prediction.

  • What are some possible additions to the model to make it more reliable in predicting the student pass rate?

    • Refine the model to include data from more sources other than the third-party software due to the financial interest in the software being used.

    • Refine the model to include student data from other schools.

    • Refine the model to include information about the community, such as redistricting.

Privacy Concerns

  • Privacy concerns arise through the mass collection of data. The content of the data may contain personal information and can affect the choice in storage and transmitting.

  • Anything done online is likely to lead to sharing of private data. Using Gmail to order a pair of shoes from Clarks could result in ads for shoes showing up in your search engine.

  • Geolocation, when used within a program, helps you find the approximate geographic location of an IP address along with some other useful information, including ISP, time zone, area code, state, and so on.

  • The high volume of e- commerce makes it difficult to determine if you are dealing with a legitimate site or an illegal phishing site. Identity theft has become common and more significant. The trade-off for the convenience of online shopping is the risk of violating privacy.

Metadata

  • Metadata are data that describe your data—for example, a picture of you standing in front of a waterfall is data. The location and time the picture was taken are metadata.

  • Metadata are used for finding, organizing, and managing information.

  • Metadata can increase the effective use of data or data sets by providing additional information about various aspects of that data.

  • Changes and deletions made to metadata do not change the primary data.

  • Data: A dog (Novack the 3rd) playing in the snow (Aspen)

  • A digital photo album contains metadata for each photo. The metadata are intended to help a search feature locate the popularity of geographic locations. All of the following are metadata and can be determined for the photo above:

    • The picture’s filename

    • The location where the picture was taken

    • The date the picture was taken

    • The author of the picture

  • Changes and deletions made to metadata do not change the primary data. Putting an image of Niagara Falls behind Novack the3rd in the photo on page 161 will not change the location metadata of where the picture was taken.

  • By using metadata, pictures can be sorted based on where the picture was taken or sorted by the day the picture was taken.

  • Metadata can be used to increase the effective use of data or data sets by providing this additional information.