Chapter 8: Data

Binary is the number system used in computer science.

Computers read machine code, which at the lowest level, is made up of 0s and 1s.

This is the binary number system.

A “0” represents no electrical charge, and a “1” represents a charge.

“Current” and “no current” are easy conditions to detect.

These binary digits are referred to as bits.

Eight bits make up a byte.

Abstraction is a concept that is a little hard for many students to grasp.

The basic idea is that we remove details to make something more general.

This makes it easier to use the process for multiple purposes versus one specific purpose.

One benefit is that we do not have to know what goes on behind the scenes, meaning how something works.

All number systems use the same principles.

Our base 10 decimal system uses 10 numbers from 0 to 9.

Each new column represents the next power of 10.

10^3 | 10^2 | 10^1 | 10^0 |
---|---|---|---|

Thousands | Hundreds | Tens | Ones |

1000 | 100 | 10 | 1 |

When you need to represent the number 10, you have to carry over to a new column, the “tens” column, and use two numbers.

Your number represents how many “tens” you have and how many “ones” you have.

When you need to represent the number “100,” you have to carry over to a third column, the “hundreds” column. (The concept of “carrying over” is now often termed “regrouping.”)

1. Write down the decimal number.

2. Subtract the largest number from the binary table that is the same or less. (When you subtract, you cannot have a negative number.)

3. Mark a 1 in the column on the table for the power of 2 you subtracted.

4. Mark an 0 in the columns that could not be subtracted and were skipped.

5. Repeat steps 2 to 4 until your decimal value reaches zero.

6. Note: Use leading 0’s on the left to make a byte (8 bits).

**Example 1: Convert 21 to binary.**

2^7 | 2^6 | 2^5 | 2^4 | 2^3 | 2^2 | 2^1 | 2^0 |
---|---|---|---|---|---|---|---|

128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

Starting from the left-most column, the first number you can subtract without having a negative result is 16.

Place a 0 in each column to the left of 2^4 and a 1 in the column for 2^4.

*21-16=5*Take the number remaining after subtracting and find the next number in the table that can be subtracted without resulting in a negative number.

You cannot subtract 8, so place a 0 in the 2^3 column.

You can subtract 4, so place a 1 in the 2^2 column.

*5-4=1*The result is now 1, so place a 0 in the 2^1 column and 1 in the 2^0 column.

Answer: 2110 = 000101012

The subscript 10 means base 10 (decimal) and the subscript 2 means base 2 (binary).

2^7 | 2^6 | 2^5 | 2^4 | 2^3 | 2^2 | 2^1 | 2^0 |
---|---|---|---|---|---|---|---|

128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |

1.Write the binary table as we did in the examples above with each bit of the binary number in the appropriate column.

2. For columns that have a 1 in them, add the values of the power of 2. (You are multiplying 0 or 1 times the value of the power of 2 in each column.)

3. The total of all columns with a 1 in them equals the decimal value equivalent.

Example 1: Convert 00011011 to decimal.

2^7 | 2^6 | 2^5 | 2^4 | 2^3 | 2^2 | 2^1 | 2^0 |
---|---|---|---|---|---|---|---|

128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |

0 * 128+ | 0 * 64+ | 0 * 32+ | 0 * 16+ | 0 * 8+ | 0 * 4+ | 1 * 2+ | 1*1 = |

*16 + 8 + 2 + 1 = 27*

In addition to the numbers we just reviewed, binary numbers can also represent letters for text fields.

Each letter has a binary value mapped to it.

The software for the particular application knows when it sees a binary number whether it is looking for a number or a letter and interprets it accordingly.

Binary numbers are also used to represent colors.

The human eye primarily detects red, green, and blue.

Other colors are a combination of these three colors in different amounts.

Computer monitors work the same way and add differing amounts of red, green, and blue to create the colors that are displayed.

**Color Chart Examples**

Color Name | RGB |
---|---|

Blue | (0,0,255) |

Silver | (192,192,192) |

Purple | (128,0,128) |

Ruby | (224,17,95) |

Emerald | (80,200,125) |

Black | (0,0,0) |

White | (255,255,255) |

The software program takes in the binary value and interprets it as a color, text value, or number, based on what the program is expecting.

If we simply had a binary number: 00101001, we would not know what it represented.

It could be a number, text, color, instruction, or other representation.

However, the software using it knows what it is and how to interpret it.

Overflow errors occur in computers when the integer to be represented needs more bits than the programming language can represent.

A fixed number of bits are assigned to hold integers in many programming languages.

When the limit is reached, an overflow error occurs.

Rounding errors occur because of the way numbers with decimal points are stored in the computer.

They are imprecise and are stored as a whole number + the decimal point + the fractional part of the number.

This imprecise nature can cause rounding errors and possibly inaccurate results in your programs.

Analog data is a continuous stream of data values.

Think of the sound a train whistle makes as it nears and then leaves a crossing.

It changes pitch (higher and lower tones) and gets louder and softer as it nears and leaves the crossing.

It can be represented by a continuous wave representing the sound.

Analog data can represent anything, including colors.

We can reduce the amount of space needed to represent the image or other file through data compression:

Lossless compression techniques allow the original image to be restored.

No data is lost, but the file size cannot be as compressed as with lossy techniques.

Lossy compression techniques lose some data in the compression process.

The original can never be restored, but the compression is greater than with lossless techniques.

Computers enable us to process data to turn it into information for decision making and research.

Data collected from all types of events—such as visits, searches, inquiries, orders, returns, temperatures, scores, attendees, acres planted, acres harvested, fish, birds, photos, videos, and audio files—are considered to be raw data.

**Cleaning:**Computers "clean" data.Remove corrupt data, repair incomplete data, and verify ranges or dates. Removing or flagging invalid data is helpful.

Data cleaning can also change "dr." to "Drive" for consistency.

Again, data errors can be missed, resulting in incorrect processing results.

**Filtering:**Computers filter data easily.This allows data interpretation by identifying and extracting subsets.

For instance, all temperature values greater than 98.6 may be meaningful and require further processing, or a dataset count may be useful.

**Classifying:**Grouping data with similar features and values helps computers make sense of large datasets.Data workers would provide these groupings or classifications.

Groupings may use one or more criteria.

Data collection purpose determines these.

**Bias:**This can unintentionally be present in data.It occurs when the data collected does not represent all the possibilities in the pool of available options.

**Patterns:**The data analysis starts with a hypothesis or question to check.The data is then processed using this criteria to see if patterns emerge.

Computers are able to identify patterns in data that people are either unable to recognize or cannot process enough data to see the pattern.

This process is known as “data mining.”

Machine learning is related to data mining, but it uses the data to make predictions.

A correlation may not mean one thing caused the other.

Always do further research with data from additional sources, not just more data from the same source, to prove the relationship exists.

Scalability is the ability to increase the capacity of a resource without having to go to a completely new solution, and for that resource to continue to operate at acceptable levels when the increased capacity is being added.

This means you do not have to bring the entire system down to add new resources.

Everything already in place keeps operating normally, and new resources are added without impacting routine processing.

The increase should be transparent to anyone using the resource.

Scalability is an important aspect to be able to store and process large datasets.

These files cannot fit on our computers or most organizations’ servers.

Parallel computing systems may be needed to process these large datasets.

Metadata is data that describes data and can help others find the data and use it more effectively.

It is not the content of the data, but includes information about the data such as:

Date

Time stamp

Author/owner

File size

File type

Changing, adding, or deleting metadata does not impact or change the actual data in any way.

It allows us to organize and add structure to our data in addition to making it easier to find.

Metadata includes “tags” that are used to identify the content.

Software tools such as spreadsheets and databases can be used to filter, organize, and search the data.

Search tools and filtering systems are needed to help analyze the data and recognize patterns.

After cleaning and organizing data, it must be reported and displayed in a user-friendly manner.

Many tools help communicate data insights.

Charts, tables, and other graphics help summarize data visually.

.

Binary is the number system used in computer science.

Computers read machine code, which at the lowest level, is made up of 0s and 1s.

This is the binary number system.

A “0” represents no electrical charge, and a “1” represents a charge.

“Current” and “no current” are easy conditions to detect.

These binary digits are referred to as bits.

Eight bits make up a byte.

Abstraction is a concept that is a little hard for many students to grasp.

The basic idea is that we remove details to make something more general.

This makes it easier to use the process for multiple purposes versus one specific purpose.

One benefit is that we do not have to know what goes on behind the scenes, meaning how something works.

All number systems use the same principles.

Our base 10 decimal system uses 10 numbers from 0 to 9.

Each new column represents the next power of 10.

10^3 | 10^2 | 10^1 | 10^0 |
---|---|---|---|

Thousands | Hundreds | Tens | Ones |

1000 | 100 | 10 | 1 |

When you need to represent the number 10, you have to carry over to a new column, the “tens” column, and use two numbers.

Your number represents how many “tens” you have and how many “ones” you have.

When you need to represent the number “100,” you have to carry over to a third column, the “hundreds” column. (The concept of “carrying over” is now often termed “regrouping.”)

1. Write down the decimal number.

2. Subtract the largest number from the binary table that is the same or less. (When you subtract, you cannot have a negative number.)

3. Mark a 1 in the column on the table for the power of 2 you subtracted.

4. Mark an 0 in the columns that could not be subtracted and were skipped.

5. Repeat steps 2 to 4 until your decimal value reaches zero.

6. Note: Use leading 0’s on the left to make a byte (8 bits).

**Example 1: Convert 21 to binary.**

2^7 | 2^6 | 2^5 | 2^4 | 2^3 | 2^2 | 2^1 | 2^0 |
---|---|---|---|---|---|---|---|

128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

Starting from the left-most column, the first number you can subtract without having a negative result is 16.

Place a 0 in each column to the left of 2^4 and a 1 in the column for 2^4.

*21-16=5*Take the number remaining after subtracting and find the next number in the table that can be subtracted without resulting in a negative number.

You cannot subtract 8, so place a 0 in the 2^3 column.

You can subtract 4, so place a 1 in the 2^2 column.

*5-4=1*The result is now 1, so place a 0 in the 2^1 column and 1 in the 2^0 column.

Answer: 2110 = 000101012

The subscript 10 means base 10 (decimal) and the subscript 2 means base 2 (binary).

2^7 | 2^6 | 2^5 | 2^4 | 2^3 | 2^2 | 2^1 | 2^0 |
---|---|---|---|---|---|---|---|

128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |

1.Write the binary table as we did in the examples above with each bit of the binary number in the appropriate column.

2. For columns that have a 1 in them, add the values of the power of 2. (You are multiplying 0 or 1 times the value of the power of 2 in each column.)

3. The total of all columns with a 1 in them equals the decimal value equivalent.

Example 1: Convert 00011011 to decimal.

2^7 | 2^6 | 2^5 | 2^4 | 2^3 | 2^2 | 2^1 | 2^0 |
---|---|---|---|---|---|---|---|

128 | 64 | 32 | 16 | 8 | 4 | 2 | 1 |

0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |

0 * 128+ | 0 * 64+ | 0 * 32+ | 0 * 16+ | 0 * 8+ | 0 * 4+ | 1 * 2+ | 1*1 = |

*16 + 8 + 2 + 1 = 27*

In addition to the numbers we just reviewed, binary numbers can also represent letters for text fields.

Each letter has a binary value mapped to it.

The software for the particular application knows when it sees a binary number whether it is looking for a number or a letter and interprets it accordingly.

Binary numbers are also used to represent colors.

The human eye primarily detects red, green, and blue.

Other colors are a combination of these three colors in different amounts.

Computer monitors work the same way and add differing amounts of red, green, and blue to create the colors that are displayed.

**Color Chart Examples**

Color Name | RGB |
---|---|

Blue | (0,0,255) |

Silver | (192,192,192) |

Purple | (128,0,128) |

Ruby | (224,17,95) |

Emerald | (80,200,125) |

Black | (0,0,0) |

White | (255,255,255) |

The software program takes in the binary value and interprets it as a color, text value, or number, based on what the program is expecting.

If we simply had a binary number: 00101001, we would not know what it represented.

It could be a number, text, color, instruction, or other representation.

However, the software using it knows what it is and how to interpret it.

Overflow errors occur in computers when the integer to be represented needs more bits than the programming language can represent.

A fixed number of bits are assigned to hold integers in many programming languages.

When the limit is reached, an overflow error occurs.

Rounding errors occur because of the way numbers with decimal points are stored in the computer.

They are imprecise and are stored as a whole number + the decimal point + the fractional part of the number.

This imprecise nature can cause rounding errors and possibly inaccurate results in your programs.

Analog data is a continuous stream of data values.

Think of the sound a train whistle makes as it nears and then leaves a crossing.

It changes pitch (higher and lower tones) and gets louder and softer as it nears and leaves the crossing.

It can be represented by a continuous wave representing the sound.

Analog data can represent anything, including colors.

We can reduce the amount of space needed to represent the image or other file through data compression:

Lossless compression techniques allow the original image to be restored.

No data is lost, but the file size cannot be as compressed as with lossy techniques.

Lossy compression techniques lose some data in the compression process.

The original can never be restored, but the compression is greater than with lossless techniques.

Computers enable us to process data to turn it into information for decision making and research.

Data collected from all types of events—such as visits, searches, inquiries, orders, returns, temperatures, scores, attendees, acres planted, acres harvested, fish, birds, photos, videos, and audio files—are considered to be raw data.

**Cleaning:**Computers "clean" data.Remove corrupt data, repair incomplete data, and verify ranges or dates. Removing or flagging invalid data is helpful.

Data cleaning can also change "dr." to "Drive" for consistency.

Again, data errors can be missed, resulting in incorrect processing results.

**Filtering:**Computers filter data easily.This allows data interpretation by identifying and extracting subsets.

For instance, all temperature values greater than 98.6 may be meaningful and require further processing, or a dataset count may be useful.

**Classifying:**Grouping data with similar features and values helps computers make sense of large datasets.Data workers would provide these groupings or classifications.

Groupings may use one or more criteria.

Data collection purpose determines these.

**Bias:**This can unintentionally be present in data.It occurs when the data collected does not represent all the possibilities in the pool of available options.

**Patterns:**The data analysis starts with a hypothesis or question to check.The data is then processed using this criteria to see if patterns emerge.

Computers are able to identify patterns in data that people are either unable to recognize or cannot process enough data to see the pattern.

This process is known as “data mining.”

Machine learning is related to data mining, but it uses the data to make predictions.

A correlation may not mean one thing caused the other.

Always do further research with data from additional sources, not just more data from the same source, to prove the relationship exists.

Scalability is the ability to increase the capacity of a resource without having to go to a completely new solution, and for that resource to continue to operate at acceptable levels when the increased capacity is being added.

This means you do not have to bring the entire system down to add new resources.

Everything already in place keeps operating normally, and new resources are added without impacting routine processing.

The increase should be transparent to anyone using the resource.

Scalability is an important aspect to be able to store and process large datasets.

These files cannot fit on our computers or most organizations’ servers.

Parallel computing systems may be needed to process these large datasets.

Metadata is data that describes data and can help others find the data and use it more effectively.

It is not the content of the data, but includes information about the data such as:

Date

Time stamp

Author/owner

File size

File type

Changing, adding, or deleting metadata does not impact or change the actual data in any way.

It allows us to organize and add structure to our data in addition to making it easier to find.

Metadata includes “tags” that are used to identify the content.

Software tools such as spreadsheets and databases can be used to filter, organize, and search the data.

Search tools and filtering systems are needed to help analyze the data and recognize patterns.

After cleaning and organizing data, it must be reported and displayed in a user-friendly manner.

Many tools help communicate data insights.

Charts, tables, and other graphics help summarize data visually.

.