knowt logo

Data Processing and Information

Data and information

Data is the Plural of Datum, essentially raw numbers, letters, symbols, sounds, or images with no meaning.

Once data is given a meaning and context it then becomes information. People often confuse data and information in truth data is just information in its raw form with no context or meaning.

Here are some examples:

Sets of data and information :

110053, 641609, 160012, 390072, 382397, 141186

If we are told that 110053, 641609, 160012, 390072, 382397, and 141186 are all postal codes in India (a context), the first set of data becomes information as it now has meaning.

01432 01223 01955 01384 01253 01284 01905 01227 01832 01902 01981 01926 01597

Similarly, if you are informed that 01432, 01223, 01955, 01384, 01253, 01284, 01905, 01227, 01832, 01902, 01981, 01926, and 01597 are telephone area dialling codes in the UK, they can now be read in context and we can understand them, as they now have a meaning.

So Data processing is just turning raw data into information by giving it meaning and context.

Data on a computer is stored as binary digits (bits) in the form of ones and zeros. It can be stored on various media such as hard disk drives, solid-state drives, DVDs, SD cards, memory sticks, or in RAM. Data is processed to produce new, meaningful information, often for analysis. This processing involves different operations, such as opening source files (e.g., .csv files) in a spreadsheet and adding formulae. Data is input, stored, processed, and output as usable information.

Data collected by Direct data sources is collected for a specific purpose and used for that purpose and that purpose only. It is often referred to as ‘original source data’. Examples of sources of direct data are questionnaires, interviews, observation, and data logging.

Advantages

Disadvantages

Data is more reliable as the source is trusted

Collecting data is time-consuming

The person collecting the data can use methods to gather specific data even if the required data is obscure, whereas with indirect data sources, this type of data may never have been collected before

It is expensive as data loggers and computers may be needed

Unnecessary data is eliminated

The sample size is too small to be used for statistical purposes due to monetary and restrictions

Data is most likely up-to-date

Data is collected in the required format.

For example, an online store collects customer details about their purchases. The store can keep statistics on product sales made and re-stocking, this is known as direct data.

Sources of direct data, for example:

  1. questionnaires

  2. interviews

  3. data logging

  4. observation

Indirect data is data obtained from a third party and used for a different purpose than what it was collected for. Examples of indirect data sources are the electoral register and businesses collecting personal information for use by other organizations (third parties).

Advantages

Disadvantages

Data is immediately available

unnecessary

data may be present

A large sample is likely to be used which means can be used for statistical purposes.

Data may be out of date

The cost is cheaper as no equipment is bought only the data

The source may not be so trustworthy

The data will be purchased from a specialist company so it will be accurate.

Data will have to be edited to the required format.

Data may be biased as the sample size may be local while the use may be national.

This company can sell customer details, such as email addresses to another company with your consent. The data obtained by the third-party company is known as indirect data.

Sources of indirect data, for example:

  1. weather data

  2. census data

  3. electoral register

  4. businesses collecting personal information when used by third parties

  5. research from textbooks, journals and websites

The quality of information can be subjective, depending on the user’s perspective, but it can also be objectively assessed based on certain factors. Poor quality data can lead to serious consequences, such as distorted business decisions, poor customer service, and a damaged reputation. For example, a UK hospital was temporarily closed due to incorrect death rate data, and in the USA, incorrectly addressed mail costs the postal service significant time and money. Accurate data is crucial for businesses to understand their performance and identify future opportunities.

Some of the factors that affect the quality of information are:

Accuracy

To ensure information is accurate, it must be free from errors and mistakes, which often depend on the accuracy of the collected data. Mistakes can occur during data collection, such as transposing numbers in a stock check. Verification and validation methods can help check data accuracy. Unambiguous questions are crucial indirect data sources to avoid misleading responses. Multiple-choice questions can help quantify responses. Inaccuracies can also arise from non-representative samples, data entry errors, or improperly calibrated sensors. Proper setup of computer systems is essential for accurate data interpretation.

When judging the quality of information, relevance is crucial. Data collected should be pertinent to the purpose it is intended for, meeting the user’s needs. Collecting irrelevant data wastes time and money. Data may be too detailed, too general, or geographically irrelevant. Clear information needs and search strategies are essential. In academic studies, selecting academic sources over biased ones is important. For example, in a school setting, teachers should focus on relevant material to help students pass their exams, rather than on interesting but irrelevant topics. During the first step, it is important to select relevant sources and rule out biased resources. Additionally, you must read thoroughly and select relevant information to find what is exactly required.

Open-ended questions have unlimited responses while close-ended ones have limited responses.

Age

The age of information significantly impacts its quality. The information must be accurate, relevant, and up-to-date. Over time, information can become outdated, leading to inaccurate results. For instance, personal information in a database that hasn’t been updated to reflect changes like marriage or having children can lead to incorrect assessments, such as in loan applications or targeted advertising. Outdated information can result in poor decisions, wasting time and money, and ultimately affecting profits.

Level of Detail

For information to be useful, it needs the right amount of detail. Too much detail can make it difficult to extract the necessary information, while too little detail may not provide a comprehensive view of the problem. Information should be concise enough for easy examination and use, without extraneous details. For example, a car company director would benefit from a graph showing monthly sales figures rather than a detailed daily report for each model over the past year. Understanding the user’s needs is crucial to providing the correct level of detail.

Completeness of the Information

High-quality information must be complete, addressing all relevant parts of a problem. Incomplete information creates gaps, making it difficult to solve problems or make informed decisions. Collecting additional data to fill these gaps can be time-consuming. For example, a car company director needs a full year’s sales figures for all models, not just the first six months or only the best-selling models. Ensuring completeness is as important as accuracy when inputting data into a database.

1.3 Encryption

1.3.1 The Need for Encryption

When personal information, such as credit card details or personal data, is sent over the internet, there is a risk of interception. Intercepted information can be altered or used for identity theft, cyber fraud, or ransom. Company secrets can be sold to rivals. However, if intercepted information is unreadable, it becomes useless to hackers. Despite vigilant security measures, hackers can still breach systems, but encryption makes the data indecipherable, rendering hacking efforts futile. Encryption keeps personal data private and secure, preventing hackers from understanding communications and protecting online banking and shopping. S encryption is when data is scrambled into a code with resulting symbols appearing jumbled up so that it cannot be understood.

Encryption converts data into a code that only authorized individuals can understand. This process applies to data transmission and storage, converting plaintext to ciphertext. While it doesn’t stop cybercriminals from intercepting data, it prevents them from understanding it. Both personal and business data are vulnerable to hacking, and encryption helps protect confidential information and maintain client trust. Encryption should be applied to computers, hard drives, pen drives, and portable devices like laptops, tablets, and smartphones to prevent data misuse if the device is hacked, lost, or stolen.

Encryption works by using an encryption key to encode the data on the sending computer. The receiving computer uses a corresponding decryption key to translate it back. A key is a collection of bits, often randomly generated, and the longer the key, the more effective the encryption. For example, 128-bit keys offer (2^{128}) combinations, making it virtually impossible to crack. Modern encryption often uses 256-bit keys, which are even more secure. The key, combined with an algorithm, creates the ciphertext.

1.3.2 Methods of Encryption

Encryption involves converting data into a code by scrambling it, resulting in jumbled symbols. The algorithms used for this process are highly complex, making it extremely difficult for even the most dedicated hackers to decipher the data. Encrypted data is known as ciphertext, while unencrypted data is called plaintext.

There are two main types of encryption:

  1. Symmetric Encryption: Uses the same key for both encryption and decryption.

  2. Asymmetric Encryption (Public-Key Encryption): Uses a pair of keys—one for encryption (public key) and one for decryption (private key).

This ensures that data remains secure and unreadable to unauthorized individuals, protecting personal and business information from cyber threats.

Symmetric Encryption

Symmetric encryption, also known as ‘secret key encryption,’ involves both the sender and the receiver using the same key to encrypt and decrypt a message. This method is faster than asymmetric encryption but poses a security risk because the encryption key must be shared with the recipient. If the key is intercepted, the message can be decrypted by anyone. To mitigate this risk, many companies use asymmetric encryption to send the secret key and then use symmetric encryption to encrypt the data. In symmetric encryption, both parties share the same private key, which scrambles and unscrambles the data.

Asymmetric Encryption

Asymmetric encryption, or ‘public-key encryption,’ uses two different keys: one public and one private. The public key, which is widely distributed, is used to encrypt data, while the private key, known only to the recipient, is used to decrypt it. This method allows secure transmission over public channels like the internet because the public key cannot be used to decrypt the message it encrypted. It is nearly impossible to derive the private key from the public key and the encrypted message. Asymmetric encryption is commonly used for sending secure emails and digitally signing documents. Asymmetric encryption uses more processing power

1.3.3 Encryption Protocols

An encryption protocol is a set of rules that dictate how algorithms should be used to secure information. Several encryption protocols exist:

  • IPsec (Internet Protocol Security): This protocol suite allows the authentication of computers and encryption of data packets to provide secure communication between two computers over a public network. It is commonly used in VPNs (Virtual Private Networks).

  • SSH (Secure Shell): This protocol enables secure remote login to a computer network. SSH is often used for logging in and performing operations on remote computers, as well as for transferring data between computers.

  • TLS (Transport Layer Security): The most popular protocol for securely accessing web pages. TLS is an improved version of the SSL (Secure Sockets Layer) protocol, and the term SSL/TLS is often used to refer to both.

The Purpose of SSL/TLS

Since TLS is a development of SSL, the terms are sometimes used interchangeably. The main purposes of SSL/TLS are to:

  • Enable encryption to protect data.

  • Ensure the authenticity of the entities exchanging data using a digital certificate.

  • Maintain data integrity to prevent corruption or alteration.

Additional purposes include:

  • Ensuring websites meet PCI DSS (Payment Card Industry Data Security Standard) rules for secure bank card payment processing.

  • Improving customer trust by demonstrating that a company uses SSL/TLS to protect its website.

Many websites use SSL/TLS to encrypt data during transfer, protecting it from attackers. SSL/TLS should be used when storing or sending sensitive data online, such as during tax returns, online shopping, or insurance renewals. Websites with an HTTPS address use SSL/TLS, which verifies the server’s identity using digital certificates. These certificates contain information like the domain name, the issuing certificate authority (CA), and the public key. Although SSL was replaced by TLS, these certificates are still referred to as SSL certificates. Valid SSL certificates can only be obtained from a CA, which conducts checks on applicants to ensure they receive a unique certificate.

The Use of SSL/TLS in Client–Server Communication

Transport Layer Security (TLS) is essential for applications requiring secure data exchange over a client-server network, such as web browsing sessions and file transfers. Similar to IPsec, TLS can also enable VPN connections and Voice over IP (VoIP). To establish an SSL/TLS connection, a client (e.g., a web browser) needs to obtain the server’s public key, found in the server’s digital certificate. This certificate proves the server’s authenticity.

When a browser wants to access a secured website, an SSL/TLS handshake occurs. This involves the client and server exchanging messages to agree on communication rules. The client sends its SSL/TLS version and a list of supported cypher suites (encryption types). The server responds with its chosen cypher suite and its SSL certificate. The client verifies the certificate’s validity and the server’s legitimacy. The client then sends an encrypted random string of bits, used to calculate the private key. The client completes its part of the handshake by sending an encrypted message to the server.

1.3.4 Uses of Encryption

Hard-Disk Encryption

Hard-disk encryption automatically encrypts files when they are written to the disk and decrypts them when read, leaving all other data on the disk encrypted. This process is understood by common application software like spreadsheets, databases, and word processors. Full disk encryption protects data even if the disk is stolen or left unattended, as only the keyholder can access its contents. However, if an encrypted disk crashes or the OS becomes corrupted, data recovery can be problematic. It is crucial to store encryption keys safely, as no one can access the data without the key. Booting up the computer can also be slower with full disk encryption.

Email Encryption

Encrypting emails ensures that their content can only be read by the intended recipient. While many people rely on passwords to protect their email accounts, emails are still susceptible to interception. Unencrypted emails can expose sensitive information to hackers. In the early days of email communication, most messages were sent in plain text, making them easily accessible to unauthorized individuals. Encrypting emails adds an essential layer of security to protect personal and sensitive information.

There are three parts to email encryption.

1 The first is to encrypt the actual connection from the email provider because this prevents hackers from intercepting and acquiring login details and reading any messages sent (or received) as they leave (or arrive at) the email provider’s server.

2 Then, messages should be encrypted before sending them so that even if a hacker intercepts the message, they will not be able to understand it. They could still delete it on interception, but this is unlikely.

3 Finally, since hackers could bypass your computer’s security settings, it is important to encrypt all your saved or archived messages.

Email encryption uses asymmetric encryption. The email sender uses the public key to encrypt the message and the receiver uses the private key to decrypt the message.

Email encryption can not encrypt your email only your message so you can’t send emails anonymously.

HTTPS

HTTPS (Hypertext Transfer Protocol Secure) extends HTTP and uses SSL/TLS protocols to encrypt data transferred between a web browser and a web server. This ensures that any data exchanged, such as login credentials, personal information, and payment details, is secure from eavesdroppers and hackers. Websites using HTTPS display a padlock icon in the browser’s address bar, indicating a secure connection. HTTPS is essential for protecting sensitive transactions, such as online banking, shopping, and any activity requiring the exchange of personal data. It also helps in verifying the authenticity of a website, ensuring users are communicating with the intended server and not a malicious site. HTT

Sure, here’s a summary of the advantages and disadvantages of different encryption protocols and methods:

Advantages of Encryption:

  1. Data Protection: Encrypting personal information, such as credit card details, prevents identity theft, cyber-fraud, and ransomware attacks.

  2. Security for Company Secrets: Protects sensitive company information from being sold to competitors.

  3. Integrity: Ensures that data cannot be altered during transmission.

Disadvantages of Encryption:

  1. Performance Impact: Encrypting data increases loading times and requires additional processing power.

  2. Resource Intensive: Uses more memory and computational power, especially with larger key sizes.

  3. Ransomware Risk: Hackers can encrypt data and demand a ransom for the decryption key.

  4. Key Management Issues: Losing the private key can result in permanent data loss. Reissuing digital certificates can be time-consuming.

  5. User Carelessness: Decrypted data left unprotected can be vulnerable to attacks.

Symmetric vs. Asymmetric Encryption:

  • Symmetric Encryption: Faster and suitable for large amounts of data but requires secure key exchange.

  • Asymmetric Encryption: More secure for key exchange but slower and computationally intensive.

SSL/TLS vs. IPsec for VPNs:

  • SSL/TLS:

    • Advantages: Easier management of digital certificates, no need for client software, and simpler setup.

    • Disadvantages: Weaker security due to optional client authentication, and extra software may be needed for non-web applications.

  • IPsec:

    • Advantages: Stronger security with mandatory client and server authentication, supported by more operating systems.

    • Disadvantages: More complex management, higher costs for client software, and time-consuming certificate management.

Ensuring the accuracy of Data

Ensuring data accuracy is crucial for producing reliable results during data processing. Data entry, often the most time-consuming part of this process, must be performed with minimal errors to avoid the need for extensive corrections or re-entry. To achieve this, two methods are employed: validation and verification. Validation ensures that the data values entered are reasonable and sensible, though it does not guarantee correctness. For instance, a validation check might prevent utility bills from exceeding a certain amount, but it wouldn’t catch a bill of $321 instead of $231 if both amounts are considered reasonable. Verification, on the other hand, focuses on the accuracy of the data entry process itself, ensuring that the data entered matches the source.

Various validation methods can be used depending on the type of data being input. For example, a reasonableness check can ensure that data values fall within a sensible range, such as preventing excessively high utility bills. However, not all fields can be easily validated; names, for instance, can vary widely and include special characters that complicate validation. In a school library database, for example, validation routines would be applied to ensure the accuracy of data in tables for books and borrowers. While validation helps ensure data is reasonable, verification ensures the accuracy of the data entry process, both of which are essential for maintaining data integrity.

Presence Check

A presence check ensures that important data is not omitted from certain fields, especially key fields like ISBN in a Books table or Borrower_ID in a Borrowers table. This check is often indicated by a red asterisk on online forms, prompting users to complete mandatory fields before proceeding. However, presence checks do not prevent incorrect or unreasonable data from being entered.

Range Check

Range checks are applied to fields containing numeric data to ensure values fall within a specified range. For example, in a Book table, a range check might ensure that the cost of a book is between $10 and $29. This type of check helps prevent errors like entering an impossible date, such as 31/2/05, by ensuring each part of the date falls within valid limits.

Type Check

Type checks ensure that data entered into a field is of the correct data type. For instance, the Borrower_ID field in a Borrowers table might be set to accept only numeric characters. However, simply setting the field data type to numeric is not always sufficient, as it might remove leading zeros, which are important in fields like telephone numbers. Therefore, a validation routine might be needed to allow only specific characters, such as digits 0 to 9, ensuring no invalid characters are entered. This type of check is also known as an invalid character check or character check. Despite its usefulness, a type check does not alert users if the correct number of characters has not been entered in a particular field.

Length Check

A length check is performed on alphanumeric fields to ensure they have the correct number of characters, but it is not used with numeric fields. Generally, it is applied to fields with a fixed number of characters, such as telephone numbers in France, which are typically 10 digits long. For instance, in a database, a length check could be applied to the ISBN field in the Books table to ensure all ISBNs are 13 characters long. Similarly, it could be used for the Borrower_ID field in the Borrowers table if these IDs are consistently four characters in length. Length checks can also be set to a range of lengths, such as phone numbers in Ireland, which vary between 8 and 10 characters. This check ensures that the entered data meets the required length, producing an error message if it does not.

Format Check

A format check ensures that data follows a specific pattern. For example, new vehicle registration plates in the UK follow a pattern of two letters, two digits, a space, and three letters (e.g., XX21 Y). In a database, a format check could be applied to the Class field in the Borrowers table to ensure it follows a pattern of two digits followed by one letter. This check is also useful for validating dates in a specific format, such as Date_of_birth, which might be two digits followed by a slash, two digits, another slash, and two digits. However, format checks do not prevent mistyped entries that still fit the pattern.

Limit Check

A limit check is similar to a range check but is applied to only one boundary. For example, in the UK, the legal driving age is 17, with no upper limit. A limit check would ensure that any entered age for a driving license application is not below 17. This type of check is useful for ensuring that data meets a minimum or maximum requirement, generating an error message if it does not.

Check Digit

A check digit is used to validate numerical data, often stored as a string of alphanumeric data. For example, the last digit of an ISBN for a book is a check digit calculated using a specific arithmetic method. Each digit in the first 12 digits of the ISBN is multiplied by 1 if it is in an odd-numbered position or by 3 if it is in an even-numbered position. The resulting numbers are added together and divided by 10. If the remainder is 0, that becomes the check digit; otherwise, the remainder is subtracted from 10 to get the check digit. This digit is then added as the 13th digit of the ISBN. When this data is entered into a database, the computer recalculates the check digit to ensure it matches. If it does not, an error message is produced, indicating a possible data entry error, such as transposing two digits.

Lookup Check

A lookup check compares entered data against a limited set of valid entries. If the data matches one of these entries, it is accepted; otherwise, an error message is produced. This check is efficient when there are a limited number of valid values, such as the days of the week. In a database, a lookup check could be used for the Class field in the Borrowers table, which might only have specific classes like 10A, 10B, 10C, etc. If an invalid class, such as 9B, is entered, the computer will not find a match and will produce an error message.

Consistency Check

A consistency check, also known as an integrity check, ensures that data across two fields is consistent. For example, in a Borrowers table, a consistency check could ensure that the Class field and the Date_of_birth field are consistent. If students in year 11 were born between 1 September 2004 and 31 August 2005, a consistency check would ensure that if the Class field is 11, the Date_of_birth must fall within this range. If not, an error message will be generated. This check is often used to ensure that a person’s age is consistent with their date of birth, although storing age is generally considered bad practice as it changes regularly and requires frequent updates.

Verification

Verification ensures that data has been entered accurately by a human or transferred correctly from one storage medium to another. There are several methods of verification, including visual checking and double data entry.

Visual Checking

Visual checking involves the person who enters the data visually comparing the entered data with the source document. This can be done by reading the data on the screen or by printing out the data and comparing it side by side with the source document. While this is the simplest verification form, it can be time-consuming and costly. Additionally, if the same person who entered the data is also checking it, they might overlook their mistakes. A more effective approach is to have someone else perform the check.

Double Data Entry

Double data entry involves entering the data twice. The first version is stored, and the second entry is compared to the first by a computer. If there are any differences, the computer alerts the person entering the data, who then checks and corrects the errors if necessary. Alternatively, two different people can enter the data, and the computer compares the two versions, alerting both operators to any discrepancies. This method ensures that data is copied accurately, although it does not verify the correctness of the original data. The key difference between visual verification and double data entry is that in the latter, the computer performs the comparison.

Parity Check

A parity check is a method used to ensure data has been transmitted accurately between devices. Computers store data in bits, with each string of bits forming a byte, typically consisting of 8 bits. ASCII (American Standard Code for Information Interchange) is commonly used to represent text, assigning numbers to characters. For instance, the ASCII code for uppercase ‘I’ is 73, represented in binary as 01001001. When transmitting data, a parity bit is added to each byte to ensure an even number of 1s, a method known as even parity.

During transmission, the sending device counts the number of 1s in each byte. If the count is even, the parity bit is set to 0; if odd, it is set to 1, ensuring the total number of 1s is even. The receiving device then checks each byte to confirm it has an even number of 1s. If an odd number of 1s is detected, it indicates an error in transmission. For example, the word “BROWN” in ASCII is represented by the bytes 01000010, 01010010, 01001111, 01010111, and 01001110. Adding parity bits results in 010000100, 010100101, 010011111, 010101111, and 010011100, respectively.

While effective, parity checks are not foolproof. If two 1s are transmitted as 0s, the byte will still have an even number of 1s, and the error will go undetected. Similarly, if a 1 and a 0 are transposed, such as 010000100 (B with a parity bit added) being transmitted as 010000010 (A with a parity bit added), the parity check will not report an error as there is still an even number of 1s. More complex error-checking methods have had to be developed, but parity checking is still very common because it is such a simple method for detecting errors.

Here’s how you can integrate the information about checksums into your text:


Checksum

Checksums are a follow-on from the use of parity checks in that they are used to check that data has been transmitted accurately from one device to another. A checksum is used for whole files of data, as opposed to a parity check which is performed byte by byte. They are used when data is transmitted, whether it be from one computer to another in a network, or across the internet, in an attempt to ensure that the file which has been received is the same as the file which was sent.

A checksum can be calculated in many different ways, using different algorithms. For example, a simple checksum could simply be the number of bytes in a file. Just as we saw with the problem with the transposition of bits deceiving a parity check, this type of checksum would not be able to notice if two or more bytes were swapped; the data would be different, but the checksum would be the same. Sometimes, encryption algorithms are used to verify data; the checksum is calculated using an algorithm called a hash function (not to be confused with a hash total, which we will be looking at next) and is transmitted at the end of the file. The receiving device recalculates the checksum, and then compares it to the one it received, to make sure they are identical.

Two common checksum algorithms are MD5 and SHA-1, but both have been found to have weaknesses. Two different files can have the same calculated checksum, so because of this, newer SHA-2 and even newer SHA-3 have been developed which are much more reliable.

The actual checksum is produced in hexadecimal format. This is a counting system that is based on the number 16, whereas we typically count numbers based on 10. You can see what each hexadecimal value represents in this table:

MD5 checksums consist of 32 hexadecimal characters, such as 591a23eacc5d55a528e22ec7b99705cc. These are added to the end of the file. After the file is transmitted, the checksum is recalculated by the receiving device and compared with the original checksum. If the checksum is different, then the file has probably been corrupted during transmission and must be sent again.

Hash Total

This is similar to the previous two methods in that a calculation is performed using the data before it is sent, then it is recalculated, and if the data has been transmitted successfully with no errors, the result of the calculation will be the same. However, this time the calculation takes a different form; a hash total is usually found by adding up all the numbers in a specific field or fields in a file. It is usually performed on data not normally used in calculations, such as an employee code number. After the data is transmitted, the hash total is recalculated and compared with the original value. If it has not been transmitted properly or data has been lost or corrupted, the totals will be different. Data will have to be sent again or the data will have to be visually checked to detect the error.

This type of check is normally performed on large files but, for demonstration purposes, we will just consider a simple example. Sometimes, school examination secretaries are asked to do a statistical analysis of exam results. Here we have a small extract from the data that might have been collected.

Student ID Number of exam passes

4762

6

0153

8

2539

7

4651

3

Normally, the Student ID would be stored as an alphanumeric type, so for a hash check, it would be converted to a number. The hash check involves adding all the Student IDs together. In this example, it would perform the calculation 4762 + 153 + 2539 + 4651 giving us a hash total of 12105. The data would be transmitted along with the hash total and then the hash total would be recalculated and compared with the original to make sure it was the same and that the data had been transmitted correctly. We would use a hash total here because there is no other point to adding the Student IDs together. Apart from verification purposes, the hash total produced is meaningless and is not used for any other purpose.

Control Total

A control total is calculated in the same way as a hash total but is only carried out on numeric fields. There is no need to convert alphanumeric data to numeric. The value produced is a meaningful one which has a use. In our example above, we can see that it would be useful for the head teacher to know what the average pass rate was each year. The control total can be used to calculate this average by dividing it by the number of students. The calculation is 6 + 8 + 7 + 3 giving us a control total of 24. If that is divided by 4, the number of students, we find that the average number of passes per student is 6. The control total check is usually carried out on much larger volumes of data than our small extract.

The use of a control total is the same as for a hash total in that the control total is added to the file, the file is transmitted and the control total is recalculated. Just as with the hash total, if the values are different, it is an indication that the data has not been transmitted or entered correctly. However, both types of checks do have their shortcomings. If two numbers were transposed, say student 4762 was entered as having 8 passes and 0153 with 6 passes, this would be an error but would not be picked up by either a control or hash total check.

Batch process

Batch processing is a method used to process large amounts of data over a period of time by grouping data into batches. Once a batch is collected, it is processed all at once, often with a delay between data collection and processing. A key component of this system is the master file, which stores permanent data (e.g., employee records), and the transaction file, which contains temporary data (e.g., hours worked) used to update the master file.

There are three types of transactions: adding, deleting, and updating records in the master file. Batch processing is common in systems like payroll, billing, and stock control.

Advantages include efficient use of computer resources, especially during off-peak hours, and the simplicity of the required computer system. It also offers speed in processing batches and reduces human error. However, batch processing can be slow due to the large volume of data and is unsuitable for tasks requiring immediate processing.

In payroll systems, for example, employee information (stored in the master file) is updated using the weekly hours worked (stored in the transaction file), and wages are calculated accordingly. Before processing, the transaction file is sorted and validated to match the order of the master file.

Data Processing and Information

Data and information

Data is the Plural of Datum, essentially raw numbers, letters, symbols, sounds, or images with no meaning.

Once data is given a meaning and context it then becomes information. People often confuse data and information in truth data is just information in its raw form with no context or meaning.

Here are some examples:

Sets of data and information :

110053, 641609, 160012, 390072, 382397, 141186

If we are told that 110053, 641609, 160012, 390072, 382397, and 141186 are all postal codes in India (a context), the first set of data becomes information as it now has meaning.

01432 01223 01955 01384 01253 01284 01905 01227 01832 01902 01981 01926 01597

Similarly, if you are informed that 01432, 01223, 01955, 01384, 01253, 01284, 01905, 01227, 01832, 01902, 01981, 01926, and 01597 are telephone area dialling codes in the UK, they can now be read in context and we can understand them, as they now have a meaning.

So Data processing is just turning raw data into information by giving it meaning and context.

Data on a computer is stored as binary digits (bits) in the form of ones and zeros. It can be stored on various media such as hard disk drives, solid-state drives, DVDs, SD cards, memory sticks, or in RAM. Data is processed to produce new, meaningful information, often for analysis. This processing involves different operations, such as opening source files (e.g., .csv files) in a spreadsheet and adding formulae. Data is input, stored, processed, and output as usable information.

Data collected by Direct data sources is collected for a specific purpose and used for that purpose and that purpose only. It is often referred to as ‘original source data’. Examples of sources of direct data are questionnaires, interviews, observation, and data logging.

Advantages

Disadvantages

Data is more reliable as the source is trusted

Collecting data is time-consuming

The person collecting the data can use methods to gather specific data even if the required data is obscure, whereas with indirect data sources, this type of data may never have been collected before

It is expensive as data loggers and computers may be needed

Unnecessary data is eliminated

The sample size is too small to be used for statistical purposes due to monetary and restrictions

Data is most likely up-to-date

Data is collected in the required format.

For example, an online store collects customer details about their purchases. The store can keep statistics on product sales made and re-stocking, this is known as direct data.

Sources of direct data, for example:

  1. questionnaires

  2. interviews

  3. data logging

  4. observation

Indirect data is data obtained from a third party and used for a different purpose than what it was collected for. Examples of indirect data sources are the electoral register and businesses collecting personal information for use by other organizations (third parties).

Advantages

Disadvantages

Data is immediately available

unnecessary

data may be present

A large sample is likely to be used which means can be used for statistical purposes.

Data may be out of date

The cost is cheaper as no equipment is bought only the data

The source may not be so trustworthy

The data will be purchased from a specialist company so it will be accurate.

Data will have to be edited to the required format.

Data may be biased as the sample size may be local while the use may be national.

This company can sell customer details, such as email addresses to another company with your consent. The data obtained by the third-party company is known as indirect data.

Sources of indirect data, for example:

  1. weather data

  2. census data

  3. electoral register

  4. businesses collecting personal information when used by third parties

  5. research from textbooks, journals and websites

The quality of information can be subjective, depending on the user’s perspective, but it can also be objectively assessed based on certain factors. Poor quality data can lead to serious consequences, such as distorted business decisions, poor customer service, and a damaged reputation. For example, a UK hospital was temporarily closed due to incorrect death rate data, and in the USA, incorrectly addressed mail costs the postal service significant time and money. Accurate data is crucial for businesses to understand their performance and identify future opportunities.

Some of the factors that affect the quality of information are:

Accuracy

To ensure information is accurate, it must be free from errors and mistakes, which often depend on the accuracy of the collected data. Mistakes can occur during data collection, such as transposing numbers in a stock check. Verification and validation methods can help check data accuracy. Unambiguous questions are crucial indirect data sources to avoid misleading responses. Multiple-choice questions can help quantify responses. Inaccuracies can also arise from non-representative samples, data entry errors, or improperly calibrated sensors. Proper setup of computer systems is essential for accurate data interpretation.

When judging the quality of information, relevance is crucial. Data collected should be pertinent to the purpose it is intended for, meeting the user’s needs. Collecting irrelevant data wastes time and money. Data may be too detailed, too general, or geographically irrelevant. Clear information needs and search strategies are essential. In academic studies, selecting academic sources over biased ones is important. For example, in a school setting, teachers should focus on relevant material to help students pass their exams, rather than on interesting but irrelevant topics. During the first step, it is important to select relevant sources and rule out biased resources. Additionally, you must read thoroughly and select relevant information to find what is exactly required.

Open-ended questions have unlimited responses while close-ended ones have limited responses.

Age

The age of information significantly impacts its quality. The information must be accurate, relevant, and up-to-date. Over time, information can become outdated, leading to inaccurate results. For instance, personal information in a database that hasn’t been updated to reflect changes like marriage or having children can lead to incorrect assessments, such as in loan applications or targeted advertising. Outdated information can result in poor decisions, wasting time and money, and ultimately affecting profits.

Level of Detail

For information to be useful, it needs the right amount of detail. Too much detail can make it difficult to extract the necessary information, while too little detail may not provide a comprehensive view of the problem. Information should be concise enough for easy examination and use, without extraneous details. For example, a car company director would benefit from a graph showing monthly sales figures rather than a detailed daily report for each model over the past year. Understanding the user’s needs is crucial to providing the correct level of detail.

Completeness of the Information

High-quality information must be complete, addressing all relevant parts of a problem. Incomplete information creates gaps, making it difficult to solve problems or make informed decisions. Collecting additional data to fill these gaps can be time-consuming. For example, a car company director needs a full year’s sales figures for all models, not just the first six months or only the best-selling models. Ensuring completeness is as important as accuracy when inputting data into a database.

1.3 Encryption

1.3.1 The Need for Encryption

When personal information, such as credit card details or personal data, is sent over the internet, there is a risk of interception. Intercepted information can be altered or used for identity theft, cyber fraud, or ransom. Company secrets can be sold to rivals. However, if intercepted information is unreadable, it becomes useless to hackers. Despite vigilant security measures, hackers can still breach systems, but encryption makes the data indecipherable, rendering hacking efforts futile. Encryption keeps personal data private and secure, preventing hackers from understanding communications and protecting online banking and shopping. S encryption is when data is scrambled into a code with resulting symbols appearing jumbled up so that it cannot be understood.

Encryption converts data into a code that only authorized individuals can understand. This process applies to data transmission and storage, converting plaintext to ciphertext. While it doesn’t stop cybercriminals from intercepting data, it prevents them from understanding it. Both personal and business data are vulnerable to hacking, and encryption helps protect confidential information and maintain client trust. Encryption should be applied to computers, hard drives, pen drives, and portable devices like laptops, tablets, and smartphones to prevent data misuse if the device is hacked, lost, or stolen.

Encryption works by using an encryption key to encode the data on the sending computer. The receiving computer uses a corresponding decryption key to translate it back. A key is a collection of bits, often randomly generated, and the longer the key, the more effective the encryption. For example, 128-bit keys offer (2^{128}) combinations, making it virtually impossible to crack. Modern encryption often uses 256-bit keys, which are even more secure. The key, combined with an algorithm, creates the ciphertext.

1.3.2 Methods of Encryption

Encryption involves converting data into a code by scrambling it, resulting in jumbled symbols. The algorithms used for this process are highly complex, making it extremely difficult for even the most dedicated hackers to decipher the data. Encrypted data is known as ciphertext, while unencrypted data is called plaintext.

There are two main types of encryption:

  1. Symmetric Encryption: Uses the same key for both encryption and decryption.

  2. Asymmetric Encryption (Public-Key Encryption): Uses a pair of keys—one for encryption (public key) and one for decryption (private key).

This ensures that data remains secure and unreadable to unauthorized individuals, protecting personal and business information from cyber threats.

Symmetric Encryption

Symmetric encryption, also known as ‘secret key encryption,’ involves both the sender and the receiver using the same key to encrypt and decrypt a message. This method is faster than asymmetric encryption but poses a security risk because the encryption key must be shared with the recipient. If the key is intercepted, the message can be decrypted by anyone. To mitigate this risk, many companies use asymmetric encryption to send the secret key and then use symmetric encryption to encrypt the data. In symmetric encryption, both parties share the same private key, which scrambles and unscrambles the data.

Asymmetric Encryption

Asymmetric encryption, or ‘public-key encryption,’ uses two different keys: one public and one private. The public key, which is widely distributed, is used to encrypt data, while the private key, known only to the recipient, is used to decrypt it. This method allows secure transmission over public channels like the internet because the public key cannot be used to decrypt the message it encrypted. It is nearly impossible to derive the private key from the public key and the encrypted message. Asymmetric encryption is commonly used for sending secure emails and digitally signing documents. Asymmetric encryption uses more processing power

1.3.3 Encryption Protocols

An encryption protocol is a set of rules that dictate how algorithms should be used to secure information. Several encryption protocols exist:

  • IPsec (Internet Protocol Security): This protocol suite allows the authentication of computers and encryption of data packets to provide secure communication between two computers over a public network. It is commonly used in VPNs (Virtual Private Networks).

  • SSH (Secure Shell): This protocol enables secure remote login to a computer network. SSH is often used for logging in and performing operations on remote computers, as well as for transferring data between computers.

  • TLS (Transport Layer Security): The most popular protocol for securely accessing web pages. TLS is an improved version of the SSL (Secure Sockets Layer) protocol, and the term SSL/TLS is often used to refer to both.

The Purpose of SSL/TLS

Since TLS is a development of SSL, the terms are sometimes used interchangeably. The main purposes of SSL/TLS are to:

  • Enable encryption to protect data.

  • Ensure the authenticity of the entities exchanging data using a digital certificate.

  • Maintain data integrity to prevent corruption or alteration.

Additional purposes include:

  • Ensuring websites meet PCI DSS (Payment Card Industry Data Security Standard) rules for secure bank card payment processing.

  • Improving customer trust by demonstrating that a company uses SSL/TLS to protect its website.

Many websites use SSL/TLS to encrypt data during transfer, protecting it from attackers. SSL/TLS should be used when storing or sending sensitive data online, such as during tax returns, online shopping, or insurance renewals. Websites with an HTTPS address use SSL/TLS, which verifies the server’s identity using digital certificates. These certificates contain information like the domain name, the issuing certificate authority (CA), and the public key. Although SSL was replaced by TLS, these certificates are still referred to as SSL certificates. Valid SSL certificates can only be obtained from a CA, which conducts checks on applicants to ensure they receive a unique certificate.

The Use of SSL/TLS in Client–Server Communication

Transport Layer Security (TLS) is essential for applications requiring secure data exchange over a client-server network, such as web browsing sessions and file transfers. Similar to IPsec, TLS can also enable VPN connections and Voice over IP (VoIP). To establish an SSL/TLS connection, a client (e.g., a web browser) needs to obtain the server’s public key, found in the server’s digital certificate. This certificate proves the server’s authenticity.

When a browser wants to access a secured website, an SSL/TLS handshake occurs. This involves the client and server exchanging messages to agree on communication rules. The client sends its SSL/TLS version and a list of supported cypher suites (encryption types). The server responds with its chosen cypher suite and its SSL certificate. The client verifies the certificate’s validity and the server’s legitimacy. The client then sends an encrypted random string of bits, used to calculate the private key. The client completes its part of the handshake by sending an encrypted message to the server.

1.3.4 Uses of Encryption

Hard-Disk Encryption

Hard-disk encryption automatically encrypts files when they are written to the disk and decrypts them when read, leaving all other data on the disk encrypted. This process is understood by common application software like spreadsheets, databases, and word processors. Full disk encryption protects data even if the disk is stolen or left unattended, as only the keyholder can access its contents. However, if an encrypted disk crashes or the OS becomes corrupted, data recovery can be problematic. It is crucial to store encryption keys safely, as no one can access the data without the key. Booting up the computer can also be slower with full disk encryption.

Email Encryption

Encrypting emails ensures that their content can only be read by the intended recipient. While many people rely on passwords to protect their email accounts, emails are still susceptible to interception. Unencrypted emails can expose sensitive information to hackers. In the early days of email communication, most messages were sent in plain text, making them easily accessible to unauthorized individuals. Encrypting emails adds an essential layer of security to protect personal and sensitive information.

There are three parts to email encryption.

1 The first is to encrypt the actual connection from the email provider because this prevents hackers from intercepting and acquiring login details and reading any messages sent (or received) as they leave (or arrive at) the email provider’s server.

2 Then, messages should be encrypted before sending them so that even if a hacker intercepts the message, they will not be able to understand it. They could still delete it on interception, but this is unlikely.

3 Finally, since hackers could bypass your computer’s security settings, it is important to encrypt all your saved or archived messages.

Email encryption uses asymmetric encryption. The email sender uses the public key to encrypt the message and the receiver uses the private key to decrypt the message.

Email encryption can not encrypt your email only your message so you can’t send emails anonymously.

HTTPS

HTTPS (Hypertext Transfer Protocol Secure) extends HTTP and uses SSL/TLS protocols to encrypt data transferred between a web browser and a web server. This ensures that any data exchanged, such as login credentials, personal information, and payment details, is secure from eavesdroppers and hackers. Websites using HTTPS display a padlock icon in the browser’s address bar, indicating a secure connection. HTTPS is essential for protecting sensitive transactions, such as online banking, shopping, and any activity requiring the exchange of personal data. It also helps in verifying the authenticity of a website, ensuring users are communicating with the intended server and not a malicious site. HTT

Sure, here’s a summary of the advantages and disadvantages of different encryption protocols and methods:

Advantages of Encryption:

  1. Data Protection: Encrypting personal information, such as credit card details, prevents identity theft, cyber-fraud, and ransomware attacks.

  2. Security for Company Secrets: Protects sensitive company information from being sold to competitors.

  3. Integrity: Ensures that data cannot be altered during transmission.

Disadvantages of Encryption:

  1. Performance Impact: Encrypting data increases loading times and requires additional processing power.

  2. Resource Intensive: Uses more memory and computational power, especially with larger key sizes.

  3. Ransomware Risk: Hackers can encrypt data and demand a ransom for the decryption key.

  4. Key Management Issues: Losing the private key can result in permanent data loss. Reissuing digital certificates can be time-consuming.

  5. User Carelessness: Decrypted data left unprotected can be vulnerable to attacks.

Symmetric vs. Asymmetric Encryption:

  • Symmetric Encryption: Faster and suitable for large amounts of data but requires secure key exchange.

  • Asymmetric Encryption: More secure for key exchange but slower and computationally intensive.

SSL/TLS vs. IPsec for VPNs:

  • SSL/TLS:

    • Advantages: Easier management of digital certificates, no need for client software, and simpler setup.

    • Disadvantages: Weaker security due to optional client authentication, and extra software may be needed for non-web applications.

  • IPsec:

    • Advantages: Stronger security with mandatory client and server authentication, supported by more operating systems.

    • Disadvantages: More complex management, higher costs for client software, and time-consuming certificate management.

Ensuring the accuracy of Data

Ensuring data accuracy is crucial for producing reliable results during data processing. Data entry, often the most time-consuming part of this process, must be performed with minimal errors to avoid the need for extensive corrections or re-entry. To achieve this, two methods are employed: validation and verification. Validation ensures that the data values entered are reasonable and sensible, though it does not guarantee correctness. For instance, a validation check might prevent utility bills from exceeding a certain amount, but it wouldn’t catch a bill of $321 instead of $231 if both amounts are considered reasonable. Verification, on the other hand, focuses on the accuracy of the data entry process itself, ensuring that the data entered matches the source.

Various validation methods can be used depending on the type of data being input. For example, a reasonableness check can ensure that data values fall within a sensible range, such as preventing excessively high utility bills. However, not all fields can be easily validated; names, for instance, can vary widely and include special characters that complicate validation. In a school library database, for example, validation routines would be applied to ensure the accuracy of data in tables for books and borrowers. While validation helps ensure data is reasonable, verification ensures the accuracy of the data entry process, both of which are essential for maintaining data integrity.

Presence Check

A presence check ensures that important data is not omitted from certain fields, especially key fields like ISBN in a Books table or Borrower_ID in a Borrowers table. This check is often indicated by a red asterisk on online forms, prompting users to complete mandatory fields before proceeding. However, presence checks do not prevent incorrect or unreasonable data from being entered.

Range Check

Range checks are applied to fields containing numeric data to ensure values fall within a specified range. For example, in a Book table, a range check might ensure that the cost of a book is between $10 and $29. This type of check helps prevent errors like entering an impossible date, such as 31/2/05, by ensuring each part of the date falls within valid limits.

Type Check

Type checks ensure that data entered into a field is of the correct data type. For instance, the Borrower_ID field in a Borrowers table might be set to accept only numeric characters. However, simply setting the field data type to numeric is not always sufficient, as it might remove leading zeros, which are important in fields like telephone numbers. Therefore, a validation routine might be needed to allow only specific characters, such as digits 0 to 9, ensuring no invalid characters are entered. This type of check is also known as an invalid character check or character check. Despite its usefulness, a type check does not alert users if the correct number of characters has not been entered in a particular field.

Length Check

A length check is performed on alphanumeric fields to ensure they have the correct number of characters, but it is not used with numeric fields. Generally, it is applied to fields with a fixed number of characters, such as telephone numbers in France, which are typically 10 digits long. For instance, in a database, a length check could be applied to the ISBN field in the Books table to ensure all ISBNs are 13 characters long. Similarly, it could be used for the Borrower_ID field in the Borrowers table if these IDs are consistently four characters in length. Length checks can also be set to a range of lengths, such as phone numbers in Ireland, which vary between 8 and 10 characters. This check ensures that the entered data meets the required length, producing an error message if it does not.

Format Check

A format check ensures that data follows a specific pattern. For example, new vehicle registration plates in the UK follow a pattern of two letters, two digits, a space, and three letters (e.g., XX21 Y). In a database, a format check could be applied to the Class field in the Borrowers table to ensure it follows a pattern of two digits followed by one letter. This check is also useful for validating dates in a specific format, such as Date_of_birth, which might be two digits followed by a slash, two digits, another slash, and two digits. However, format checks do not prevent mistyped entries that still fit the pattern.

Limit Check

A limit check is similar to a range check but is applied to only one boundary. For example, in the UK, the legal driving age is 17, with no upper limit. A limit check would ensure that any entered age for a driving license application is not below 17. This type of check is useful for ensuring that data meets a minimum or maximum requirement, generating an error message if it does not.

Check Digit

A check digit is used to validate numerical data, often stored as a string of alphanumeric data. For example, the last digit of an ISBN for a book is a check digit calculated using a specific arithmetic method. Each digit in the first 12 digits of the ISBN is multiplied by 1 if it is in an odd-numbered position or by 3 if it is in an even-numbered position. The resulting numbers are added together and divided by 10. If the remainder is 0, that becomes the check digit; otherwise, the remainder is subtracted from 10 to get the check digit. This digit is then added as the 13th digit of the ISBN. When this data is entered into a database, the computer recalculates the check digit to ensure it matches. If it does not, an error message is produced, indicating a possible data entry error, such as transposing two digits.

Lookup Check

A lookup check compares entered data against a limited set of valid entries. If the data matches one of these entries, it is accepted; otherwise, an error message is produced. This check is efficient when there are a limited number of valid values, such as the days of the week. In a database, a lookup check could be used for the Class field in the Borrowers table, which might only have specific classes like 10A, 10B, 10C, etc. If an invalid class, such as 9B, is entered, the computer will not find a match and will produce an error message.

Consistency Check

A consistency check, also known as an integrity check, ensures that data across two fields is consistent. For example, in a Borrowers table, a consistency check could ensure that the Class field and the Date_of_birth field are consistent. If students in year 11 were born between 1 September 2004 and 31 August 2005, a consistency check would ensure that if the Class field is 11, the Date_of_birth must fall within this range. If not, an error message will be generated. This check is often used to ensure that a person’s age is consistent with their date of birth, although storing age is generally considered bad practice as it changes regularly and requires frequent updates.

Verification

Verification ensures that data has been entered accurately by a human or transferred correctly from one storage medium to another. There are several methods of verification, including visual checking and double data entry.

Visual Checking

Visual checking involves the person who enters the data visually comparing the entered data with the source document. This can be done by reading the data on the screen or by printing out the data and comparing it side by side with the source document. While this is the simplest verification form, it can be time-consuming and costly. Additionally, if the same person who entered the data is also checking it, they might overlook their mistakes. A more effective approach is to have someone else perform the check.

Double Data Entry

Double data entry involves entering the data twice. The first version is stored, and the second entry is compared to the first by a computer. If there are any differences, the computer alerts the person entering the data, who then checks and corrects the errors if necessary. Alternatively, two different people can enter the data, and the computer compares the two versions, alerting both operators to any discrepancies. This method ensures that data is copied accurately, although it does not verify the correctness of the original data. The key difference between visual verification and double data entry is that in the latter, the computer performs the comparison.

Parity Check

A parity check is a method used to ensure data has been transmitted accurately between devices. Computers store data in bits, with each string of bits forming a byte, typically consisting of 8 bits. ASCII (American Standard Code for Information Interchange) is commonly used to represent text, assigning numbers to characters. For instance, the ASCII code for uppercase ‘I’ is 73, represented in binary as 01001001. When transmitting data, a parity bit is added to each byte to ensure an even number of 1s, a method known as even parity.

During transmission, the sending device counts the number of 1s in each byte. If the count is even, the parity bit is set to 0; if odd, it is set to 1, ensuring the total number of 1s is even. The receiving device then checks each byte to confirm it has an even number of 1s. If an odd number of 1s is detected, it indicates an error in transmission. For example, the word “BROWN” in ASCII is represented by the bytes 01000010, 01010010, 01001111, 01010111, and 01001110. Adding parity bits results in 010000100, 010100101, 010011111, 010101111, and 010011100, respectively.

While effective, parity checks are not foolproof. If two 1s are transmitted as 0s, the byte will still have an even number of 1s, and the error will go undetected. Similarly, if a 1 and a 0 are transposed, such as 010000100 (B with a parity bit added) being transmitted as 010000010 (A with a parity bit added), the parity check will not report an error as there is still an even number of 1s. More complex error-checking methods have had to be developed, but parity checking is still very common because it is such a simple method for detecting errors.

Here’s how you can integrate the information about checksums into your text:


Checksum

Checksums are a follow-on from the use of parity checks in that they are used to check that data has been transmitted accurately from one device to another. A checksum is used for whole files of data, as opposed to a parity check which is performed byte by byte. They are used when data is transmitted, whether it be from one computer to another in a network, or across the internet, in an attempt to ensure that the file which has been received is the same as the file which was sent.

A checksum can be calculated in many different ways, using different algorithms. For example, a simple checksum could simply be the number of bytes in a file. Just as we saw with the problem with the transposition of bits deceiving a parity check, this type of checksum would not be able to notice if two or more bytes were swapped; the data would be different, but the checksum would be the same. Sometimes, encryption algorithms are used to verify data; the checksum is calculated using an algorithm called a hash function (not to be confused with a hash total, which we will be looking at next) and is transmitted at the end of the file. The receiving device recalculates the checksum, and then compares it to the one it received, to make sure they are identical.

Two common checksum algorithms are MD5 and SHA-1, but both have been found to have weaknesses. Two different files can have the same calculated checksum, so because of this, newer SHA-2 and even newer SHA-3 have been developed which are much more reliable.

The actual checksum is produced in hexadecimal format. This is a counting system that is based on the number 16, whereas we typically count numbers based on 10. You can see what each hexadecimal value represents in this table:

MD5 checksums consist of 32 hexadecimal characters, such as 591a23eacc5d55a528e22ec7b99705cc. These are added to the end of the file. After the file is transmitted, the checksum is recalculated by the receiving device and compared with the original checksum. If the checksum is different, then the file has probably been corrupted during transmission and must be sent again.

Hash Total

This is similar to the previous two methods in that a calculation is performed using the data before it is sent, then it is recalculated, and if the data has been transmitted successfully with no errors, the result of the calculation will be the same. However, this time the calculation takes a different form; a hash total is usually found by adding up all the numbers in a specific field or fields in a file. It is usually performed on data not normally used in calculations, such as an employee code number. After the data is transmitted, the hash total is recalculated and compared with the original value. If it has not been transmitted properly or data has been lost or corrupted, the totals will be different. Data will have to be sent again or the data will have to be visually checked to detect the error.

This type of check is normally performed on large files but, for demonstration purposes, we will just consider a simple example. Sometimes, school examination secretaries are asked to do a statistical analysis of exam results. Here we have a small extract from the data that might have been collected.

Student ID Number of exam passes

4762

6

0153

8

2539

7

4651

3

Normally, the Student ID would be stored as an alphanumeric type, so for a hash check, it would be converted to a number. The hash check involves adding all the Student IDs together. In this example, it would perform the calculation 4762 + 153 + 2539 + 4651 giving us a hash total of 12105. The data would be transmitted along with the hash total and then the hash total would be recalculated and compared with the original to make sure it was the same and that the data had been transmitted correctly. We would use a hash total here because there is no other point to adding the Student IDs together. Apart from verification purposes, the hash total produced is meaningless and is not used for any other purpose.

Control Total

A control total is calculated in the same way as a hash total but is only carried out on numeric fields. There is no need to convert alphanumeric data to numeric. The value produced is a meaningful one which has a use. In our example above, we can see that it would be useful for the head teacher to know what the average pass rate was each year. The control total can be used to calculate this average by dividing it by the number of students. The calculation is 6 + 8 + 7 + 3 giving us a control total of 24. If that is divided by 4, the number of students, we find that the average number of passes per student is 6. The control total check is usually carried out on much larger volumes of data than our small extract.

The use of a control total is the same as for a hash total in that the control total is added to the file, the file is transmitted and the control total is recalculated. Just as with the hash total, if the values are different, it is an indication that the data has not been transmitted or entered correctly. However, both types of checks do have their shortcomings. If two numbers were transposed, say student 4762 was entered as having 8 passes and 0153 with 6 passes, this would be an error but would not be picked up by either a control or hash total check.

Batch process

Batch processing is a method used to process large amounts of data over a period of time by grouping data into batches. Once a batch is collected, it is processed all at once, often with a delay between data collection and processing. A key component of this system is the master file, which stores permanent data (e.g., employee records), and the transaction file, which contains temporary data (e.g., hours worked) used to update the master file.

There are three types of transactions: adding, deleting, and updating records in the master file. Batch processing is common in systems like payroll, billing, and stock control.

Advantages include efficient use of computer resources, especially during off-peak hours, and the simplicity of the required computer system. It also offers speed in processing batches and reduces human error. However, batch processing can be slow due to the large volume of data and is unsuitable for tasks requiring immediate processing.

In payroll systems, for example, employee information (stored in the master file) is updated using the weekly hours worked (stored in the transaction file), and wages are calculated accordingly. Before processing, the transaction file is sorted and validated to match the order of the master file.