1.1 Data and information

Data processing and

information

1.1 Data and information

Before we consider data processing, we need to define the term data. To be

completely accurate, the word ‘data’ is the plural of ‘datum’, a single piece of

data. Often, however, we use data in both the singular and the plural senses.

It seems awkward to say ‘the data are incorrect’ so we tend to say ‘the data is

incorrect’. When we use the word ‘data’, it can mean many different things. A lot

of people frequently confuse the terms ‘data’ and ‘information’. For the purposes

of this course we will consider data to be what is usually known as ‘raw’ data.

Data can take several forms; it can be characters, symbols, images, audio clips,

video clips and so on, none of which, on their own, have any meaning.

It is important for you to learn what the term

information means when we use

it in information technology. Information is data that has been given meaning,

which often results from the processing of data, sometimes by a computer. The

processed data can then be given a context and have meaning.

The difference between data and information is that data has no meaning,

whereas information is data which has been given meaning.

Here are some examples:

Sets of data:

110053, 641609, 160012, 390072, 382397, 141186

01432 01223 01955 01384 01253 01284 01905 01227 01832 01902 01981

01926 01597

σωρ F m a

In this chapter you will learn:

+ what is meant by the terms ‘data’ and

‘information’ and about their use

+ what is meant by the terms ‘direct’ and

‘indirect’ data and about their uses and sources

+ what is meant by ‘quality of information’

+ what is meant by ‘encryption’, why it is

needed, and about the methods and uses of

encryption and protocols

+ about the methods and uses of validation and

verification

+ aboutthemethodsandusesofdifferent

methods of processing (batch, online, real-time)

+ howtowriteasimplealgorithm.

Before starting this chapter you should:

+ be familiar with the terms ‘observation’, ‘interviews’, ‘questionnaires’,

‘central processing unit (CPU)’, ‘chip and PIN’, ‘direct access’,

‘encryption’, ‘file’, ‘key field’, ‘RFID’, ‘sort’, ‘validation’ and ‘verification’.

1 DATA PROCESSING AND INFORMATION

These are sets of data which do not have a meaning until they are put into

context.

If we are told that 110053, 641609, 160012, 390072, 382397, and 141186 are

all postal codes in India (a context), the first set of data becomes information as

it now has meaning.

Similarly, if you are informed that 01432, 01223, 01955, 01384, 01253,

01284, 01905, 01227, 01832, 01902, 01981, 01926, and 01597 are telephone

area dialling codes in the UK, they can now be read in context and we can

understand them, as they now have a meaning.

The final set of data seems to be letters of the Greek alphabet apart from F, m

and a, which are letters in the Latin alphabet. However, if we are told that the

context is mathematical, scientific or engineering

formulae, we can see that they

all represent a different variable: σ represents standard deviation, ω represents

angular velocity, ρ is density, F is force, m is mass and a is acceleration. They

each now have meaning.

1.1.1 Data processing

On a computer, data is stored as a sequence of binary digits (bits) in the form of

ones and zeros. We will discuss bits later in this chapter, when we look at parity

checks. We can store data on a fixed or removeable media such as hard-disk

drive, solid-state drive, DVDs, SD cards, memory sticks or in R AM. Data is

usually processed for a particular purpose, often so that it can be analysed. The

computer processing involved uses different operations to produce new data

from the source data. You will, perhaps, have met this in previous practical work

you have done, where you may have been given source files, including

.csv files.

You may have been asked to open these in a spreadsheet and add formulae. This

is the processing of that data so that it then has meaning.

To sum up, data is input, stored, and processed by a computer, for output

as usable information. Later in this chapter we will look at different types of

processing.

Activity 1a

Explain the difference between data and information.

1.1.2 Direct and indirect data

Direct data is data that is collected for a specific purpose or task and is used

for that purpose and that purpose only. It is often referred to as ‘original

source data’. Examples of sources of direct data are questionnaires, interviews,

observation, and

data logging.

Indirect data is data that is obtained from a third party and used for a different

purpose to that which it was originally collected for and which is not necessarily

related to the current task. Examples of sources of indirect data are the electoral

organisations (third parties).

Direct data sources

Direct data sources are sources that provide the data gatherer with original

data. We will consider four such sources.

1.1 Data and information

Questionnaires

A questionnaire consists of a set of questions, usually on a specific subject

or issue. The questions are designed to gather data from those people being

questioned. A questionnaire can be a mixture of what are called closed

questions (where you have to choose one or more answers from those

provided) and open-ended questions (questions where you can write in your

answers in more detail). Questionnaires are easy to distribute, complete and

collect as most people are familiar with the process. They can be completed on

paper or on computer.

Interviews

An interview is a formal meeting, usually between two people, where one of

them, the interviewer, asks questions, and the other person, the interviewee,

answers those questions. Interviews are used to collect data about a topic

and can be structured or unstructured. Structured interviews are similar to

a questionnaire, whereby the same questions are asked in the same order for

each interviewee and with a choice of answers. Unstructured interviews can be

different for each interviewee, particularly as they give them the opportunity

to expand on their answers. There is usually no pre-set list of answers in an

unstructured interview.

Observation

Observation is a method of data collection in which the data collectors

watch what happens in a given situation. The observer collects data by seeing

for themselves what happens, rather than depending on the answers from

interviewees or the accuracy of completed questionnaires.

Data logging

Data logging means using a computer and sensors to collect data. The

data is then analysed, saved and the results are output, often in the form of

graphs and charts. Data logging systems can gather and display data as an

event happens. The data is usually collected over a period of time, either

continuously or at regular intervals, in order to observe particular trends. It

involves recording data from one or more sensors and the analysis usually

requires special software. Data logging is commonly used in scientific

experiments, in

monitoring systems where there is the need to collect

information faster than a human possibly could, in hazardous circumstances

such as volcanoes and nuclear reactors, and in cases where accuracy is

essential. Examples of the types of information a data logging system can

collect include temperatures, sound frequencies, light intensities, electrical

currents, and pressure.

Uses of direct data

An example of a use of direct data could be planning the alteration of a bus

route. A committee of residents on a new housing development, just outside

a local village, wants a bus company to re-route the bus service from the local

village to the town centre so that residents on the new development are able

to get to the town centre more easily. It will, however, involve the bus route

running through open countryside near the village. In order to persuade the

bus company to change the bus route, the committee will need to collect some

original data to present to them.

1 DATA PROCESSING AND INFORMATION

This data will include:

» How long it takes to walk from the new development to the existing bus

routes.

» The number of passengers who use the existing route.

» The number of passengers who would use the new route.

» The effect the villagers think the changed route would have on their daily

lives.

Here are some examples of how data could be collected.

» How long it takes to walk from the new development to the existing bus

routes.

Original data could be collected by actually walking from various points in

the new development and timing how long it would take. This might not be

practical if several points on the new development have to be considered, given

the time it would take to measure all the possible walking times.

» The number of passengers who use the existing route.

The suggested method to be used is a data-logger. Sensors fitted around the

door of each bus could be used to count the numbers of passengers boarding

and getting off at each stop. From these it can be calculated how many

passengers are on the bus at any point along its route. The data would be fed

back to a data-logger or a tablet computer.

» The number of passengers who would use the new route.

In order to save time, questionnaires could be used. People living on the new

development would be asked to complete the questionnaires. The completed

questionnaires could then be transferred to a computer. Provided that the

questionnaires were completed honestly, an accurate assessment of how many

passengers would use the new route could be obtained.

» The effect the villagers think the changed route would have on their daily

lives.

In order to ensure completely honest responses, face-to-face interviews would be

best. The disadvantages of interviews are the length of time the process would

take and the potential difficulties of transferring the responses into a format that

a computer could deal with. However, because the interviewer can add follow-

up questions, the answers would be more accurate.

Indirect data sources

Indirect data sources are third-party sources that the data gatherer can obtain

data from. We will consider two such sources.

Electoral register

The

electoral register, also referred to as the electoral roll, is an example

of an indirect data source. It is a list of adults who are entitled to vote in an

election. Some countries have an ‘open’ version of the register which can be

purchased and used for any purpose. Electoral registers are used in countries

such as the USA, the UK, Australia, New Zealand, and many others. They

contain information such as name, address, age and other personal details,

although individuals are often allowed to remove some details from the open

1.1 Data and information

version. In many countries,the full register can only be used for limited

purposes specified by the law in that country. The personal data in the

data protection laws.

Businesses collecting data from third parties

Businesses collect a great deal of personal information from

third parties,

such as their customers, when they sell them a product. Whenever they buy

something online, customers have to enter personal information or they have

already done this on a previous visit to that business’s website. It is often the

case that they agree to the business sharing this with other organisations.

Another development, in this regard, has been the emergence of data brokers.

These are companies that collect and analyse some of an individual’s most

sensitive personal information and sell it to each other and to advertisers and

other organisations without the individual’s knowledge. They usually gather it

by following people’s internet, social media, and mobile phone activity.

Uses of indirect data

Apart from elections and other government purposes, the electoral register

can only be used to select individuals for jury service or by credit reference

agencies. These agencies are allowed to buy the full register to help them check

the names and addresses of people applying for credit. They are also allowed to

use the register to carry out identity checks in an attempt to deal with money

laundering. It is a criminal offence for anyone to supply or use the register for

anything else. The open register, however, can have various uses; businesses can

use it to perform identity checks on their customers, charities often use it for

fundraising purposes, debt-collection agencies use it to track down people who

owe money. Whenever the address of an individual is required, a business could

use the open register to check it.

Businesses which collect personal information often use it to create mailing lists

that they then sell to other organisations, which are then able to send emails or

even brochures through the post.

These examples are not the only type of indirect data source. Any

organisation that provides data or information to the general public for

use by them can be said to be an indirect source. In the bus route example

described above, an indirect source could be used to provide some of the

required information. For instance, the timetable of the current bus service

could be used by the committee to work out the number of passengers using

the route by seeing how many times the bus runs during the day. However,

this would not be very accurate, as the buses may not carry a full load of

passengers each time and this is clearly not the purpose for which the data

was intended.

Another scenario could be studying pollution in rivers. Direct data sources

could be used, of couse; questionnaires could be handed out to local

landowners and residents in houses near to the river, asking about the effects

on them of the pollution, and they could also be interviewed. Computers

with sensors could be used to collect data from the river. However, indirect

data sources could also be used; documents may have been published by

government departments showing pollution data for the area and there may be

environmental campaigners who have also published data related to pollution

in the area.

1 DATA PROCESSING AND INFORMATION

Advantages and disadvantages of direct and indirect data

Here is a table showing the advantages and disadvantages of direct data when

compared to indirect data. Notice how each paragraph contains comparisons:

Advantages of direct data Disadvantages of direct data

We know how reliable direct data is since

we know where it originated. Where data

is required from a whole group of people,

we can ensure that a representative

cross-section of that group is sampled.

With indirect data sources we may not

know where the data originated and it

could be that the source is only a small

section of that group, rather than a

cross-section of the whole group. This is

often referred to as sampling bias.

Because of time and cash restraints, the sample or

group size may be small whereas indirect data sources

tend to provide larger sets of data that would use up

less time and money than using direct data collection

with a larger sample size.

The person collecting the data may not be able to

gain physical access to particular groups of people

(perhaps for geographical reasons), whereas the use of

indirect data sources allows data from such groups to be

gathered.

In addition, using a direct data source could be

problematic if the people being interviewed are not

available thus reducing sample size, whereas using

indirect data sources allows the sample size to be greater

resulting in increased confidence in the results produced.

The person collecting the data can use

methods to gather specific data even if

the required data is obscure, whereas with

indirect data sources this type of data

may never have been collected before.

It may not be possible to gather original data due to

the time of year; for example summer rainfall data

may be needed but at the time of the data-gathering,

it is winter. With indirect data, historical weather data

is available irrespective of the time of year.

The data collector or gatherer only needs

to collect as much or as little data as

necessary compared to indirect data

sources, where the original purpose for

which data was collected may be quite

different to the purpose for which it is

needed now. Irrelevant data may need to

be removed.

To gather data from a specific sample would take a lot

longer than it would with indirect data. In addition,

by the time all the required data has been collected it

may possibly be out of date so an indirect data source

could have been used.

Indirect data may be of a higher quality as it

might have already been collated and grouped into

meaningful categories whereas with direct data

sources, questionnaire answers can sometimes be

difficult to read and the transcripts of interviews take

time to read in order to create the data source.

Once the data has been collected it

may be useful to other organisations

and there may be opportunities to sell

the data to them, reducing the expense

of collection. With indirect data this

opportunity will probably not arise as

organisations can go direct to the source

themselves.

Compared to indirect data sources, the collection

of data may be more expensive than using an

indirect data source as people may have to be paid

to collect it. Extra cost may be incurred as special

equipment has to be bought, such as data-loggers

and computers with sensors, or purchasing the paper

for questionnaires, whereas this would not be needed

using an indirect source. There are, however, still

costs involved when using indirect data sources, such

as the travelling expenses and time taken to go to

the source, which can be fairly expensive but not as

expensive as using direct data sources.

V Table 1.1 Advantages and disadvantages of direct and indirect data

Activity 1b

1 Explain why observation is considered to be a direct data source.

2 Give two differences between indirect data and direct data.

1.2 Quality of information

Measuring the quality of information is sometimes based on the value which

the user places on the information collected. As such it could be argued that

the judgement regarding the quality of information is fairly subjective, that is,

it depends on the user and such judgements can vary between users. However,

many experts do suggest that these judgements can be objective if based on

factors which are believed to affect the quality of information.

Poor quality data can lead to serious consequences. Poor data may give a

distorted view of business dealings, which can then lead to a business making

poor decisions. Customers can be put off dealing with businesses that give poor

service due to inaccurate data, causing the business to get a poor reputation.

With poor quality data it can be difficult for companies to have accurate

knowledge of their current performance and sales trends, which makes it hard

for them to identify worthwhile future opportunities.

One example can be seen in the data provided by a hospital in the UK, which

resulted in it being temporarily closed down, until it was realised that the death

rate data provided had been incorrect and it was actually significantly lower.

Meanwhile, in the USA, incorrectly addressed mail costs the postal service a

substantial amount of money and time to process correctly.

Some of the factors that affect the quality of information are described here.

1.2.1 Factors that affect the quality of information

Accuracy

As far as possible, information should be free from errors and mistakes. The

accuracy of information often depends on the accuracy of the collected

data before it is processed. If the original data is inaccurate then the resulting

information will also be inaccurate. In order to make sure the information is

accurate, a lot of time needs to be given to the collection and checking of the

data. Mistakes can easily occur. Consider a simple stock check. If it is carried

out manually, a quantity of 62 could easily be copied down as 26 if the digits

were accidentally transposed. This information is now inaccurate. More careful

checking of the data might have prevented this.

We will look at methods of checking the accuracy of data, such as

verification and

validation, later in this chapter. It is easy to see how errors might occur during the

data collection process. When using a direct data source, if we have not made the

questions clear then the people answering the questionnaires or being interviewed

may not understand them. We need to make sure that questions are clearly phrased

and are unambiguous, otherwise they might lead interviewees into providing the

answers that they think the interviewer is expecting. This can lead to the same

response being given by everyone, even though the question is open-ended. If

the questions are too open-ended, it could be difficult to quantify the responses.

It is often a good idea to include multiple-choice questions where the respondent

chooses an answer from those provided. These can be quantified quite easily. It is

important, however, to include a sufficient number of alternative answers.

Other reasons why the information derived from a study might be inaccurate

are that the sample chosen is not representative of the whole group or that the

data collector makes some errors when collecting or when entering the data into

a computer. If sensors are being used, these must be calibrated before use and

must be properly connected to the computer. In addition, the computer system

needs to be set up correctly so that the readings are interpreted correctly.

1 DATA PROCESSING AND INFORMATION

Relevance

When judging the quality of information, we need to consider the data that is

being collected. Relevance is an important factor because there has to be a good

reason why that particular set of data is being collected. Data captured should be

relevant to the purposes for which it is to be used. It must meet the requirements

of the user. The question needs to be asked is the data really needed or is it

being collected just for the sake of it. The relevance of data matters because the

collection of irrelevant data will entail a waste of time as well as money.

There are a number of ways in which the data may or may not be relevant to the

user’s needs. It could be too detailed or concentrate too much on one aspect.

On the other hand, it might be too general, covering more aspects of the task

than is necessary. It may relate to geographical areas that are not really part of

the study. Where the study is meant to be about pollution in a local area, for

example, data from other parts of the country would not be relevant. When

looking for relevant information, it is important to be clear about what the

information needs are for each specific search.

It is also necessary to be clear about the search strategy: what the user wants

and does not want to find and therefore what the user needs to look for. In

an academic study, it is important to select academic sources. Business sources

or sources which appear to have a vested interest should be ignored. Having

selected the sources, it is important to select the relevant information within

them. Consider a school situation. You need to study a tremendous amount of

information to prepare for your exams. How would you feel if your teachers

chose to spend several lessons talking about aspects of the subject that they

found really interesting? You may find that it was very interesting, but it

probably would not be very relevant to what you need to pass your course.

Age

How old information is can affect its quality. As well as being accurate and

relevant, information needs to be up-to-date. Most information tends to change

over time and inaccurate results can arise from information which has not been

updated regularly. This could apply, for example, to personal information in a

database being left unchanged. Someone could get married and have a baby.

If the original information was used, which had the person as single with no

dependants, this would produce inaccurate results if the person was applying for

a loan. This is because people who are married with children tend to be viewed

as being more responsible and more likely to keep up with repayments. This

inaccurate information would also affect a retailer’s targeted advertising if it

wanted to sell baby products to such customers, as the person would not appear

on its list of targets. The age of information is important, because information

that is not up-to-date can lead to people making the wrong decisions. In turn,

that costs organisations time, money, and therefore, profits.

Level of detail

For information to be useful, it needs to have the right amount of detail.

Sometimes, it is possible for the information to have too much detail, making it

difficult to extract the information you really want. Information should be in a

form that is short enough to allow for its examination and use. There should be

no extraneous information. For example, it is usual to summarise statistical data

and produce this information either in the form of a

table or using a chart. Most

people would consider a chart to be more concise than data in tables, as there is

little or no unnecessary information in a chart. A balance has to be struck between

the level of detail and conciseness. Suppose a car company director wants to see

1.3 Encryption

a summary of the sales figures of all car models for the last year; the information

with the correct level of detail would be a graph showing the overall figures for

each month. If the director was given figures showing the sales of each model

every day of the previous 12 months in the form of a large report, this would be

seen as the wrong level of detail because it is not a summary. It is important to

understand what the user needs when they ask you for specific information.

On the other hand, the information might not have enough detail, meaning

that you do not get an overall view of the problem. This links closely to the

issue of the completeness of the information, which we will look at next.

Completeness of the information

In order for information to be of high quality it needs to be complete. To be

complete, information must deal with all the relevant parts of a problem. If it is

incomplete, there will be gaps in the information and it will be very difficult to

use to solve a particular problem or choose a certain course of action. Discovering

and collecting the extra information in order to remove these gaps may result in

improving the quality of the information, but can prove to be time-consuming.

Therefore, if the information is not complete, a decision has to be made: either

that it is complete enough to make a decision about a problem or that additional

data needs to be collected to complete the information. Consider the car company

director mentioned above who wants to see a summary of the sales figures for

the last year. If the director was given figures showing the sales for the first six

months, this would be incomplete. If the director was shown the figures for only

the best-selling models, this would be incomplete. It is important to understand

what the user needs when they ask you for specific information. To sum up,

completeness is as necessary as accuracy when inputting data into a database.

Activity 1c

1 List two factors that affect the quality of information.

2 Briefly describe what is meant by the quality of information.

1.3 Encryption

1.3.1Theneedforencryption

Whenever you send personal information across the internet, whether it is credit

card information or personal details, there is a risk that it can be intercepted.

Once it is intercepted the information can be changed or used for purposes such

identity theft, cyber-fraud, or ransomed off. If it is information regarding a

company’s secrets, it could be sold by

hackers to rival companies. If, however,

the information is intercepted but it is unreadable or cannot be understood, it

becomes useless to the hacker or interceptor. Too many companies or individuals

become victims of hackers taking advantage of readily available usernames

and

passwords. No matter how vigilant we are regarding the security of our

computer systems, hackers will always find a way of getting into them, but if

they cannot decipher the information, it will mean the act of hacking is not

worthwhile. This is where encryption comes in. Encryption keeps much of our

personal data private and secure, often without us realising it. It prevents hackers

from reading and understanding our personal communications and protects

us when we bank and shop. Data is scrambled or jumbled up so that it is

completely unreadable. This prevents hackers understanding the data, as all they

see is a random selection of letters, numbers and symbols.

1 DATA PROCESSING AND INFORMATION

Encryption is a way of scrambling data so that only authorised people can

understand the information. It is the process of converting information into a

code which is impossible to understand. This process is used whether the data is

being transmitted across the internet or is just being stored. It does not prevent

cyber criminals intercepting sensitive information, but it does prevent them

from understanding it. Technically, it is the process of converting

plaintext to

ciphertext.

It is not just personal computers that are affected; businesses and commercial

organisations are also liable to be affected by hacking activities. Employing data

encryption is a safe way for companies to protect their confidential information

and their reputation with their clients, since the benefits of encryption do not

just apply to the use of the internet.

Information should also be encrypted on computers, hard-disk drives, pen

drives and portable devices, whether they be laptops, tablets, or smartphones.

The misuse of the data on these devices will be prevented, should the device be

hacked, lost orstolen.

1.3.2 Methods of encryption

Encryption is the name given to converting data into a code by scrambling it,

with the resulting symbols appearing to be all jumbled up. The algorithms (we

will be looking at the topic of algorithms in much more detail in Chapter 4)

which are used to convert the data are so complex that even the most dedicated

hacker with plenty of time to spare and hacking software to help them would be

extremely unlikely to discover the meaning of the data. Encrypted data is often

called ciphertext, whereas data before it is encrypted is called plaintext.

The way that encryption works is that the computer sending the message

uses an

encryption key to encode the data. The receiving computer has a

corresponding decryption key that can translate it back again. The process of

decryption here is basically reversing the encryption. A key is just a collection of

bits, often randomly generated by a computer. The greater the length of the key,

the more effective the encryption. Many systems use 128-bit keys which gives

128

different combinations. It has been estimated that it would take a really

powerful computer 10

(1000000000000000 000 [one quintillion]) years

to go through all the different combinations. Modern encryption uses 256-

bit keys which would take very much longer to crack. As you can imagine, this

makes this form of encryption virtually impossible to crack. The key is used in

conjunction with an algorithm to create the ciphertext.

This is a test

message

which is to

be encrypted.

This is a test

message

which is to

be encrypted.

Encryption

algorithm

Encryption key

Decryption

algorithm

Decryption key

Fhk*$r

tldbh6

0)qARt

B!&Dl

Ntf8aL

Kwas7

Plaintext Ciphertext Plaintext

V Figure 1.1 Encryption

There are two main types of encryption. One is called symmetric encryption

and the other is asymmetric encryption, which is also referred to as public-key

encryption.

1.3 Encryption

Symmetric encryption

Symmetric encryption, often referred to as ‘secret key encryption’, involves the

sending computer, or user, and the receiving computer, or user, having the same

key to encrypt and decrypt a message. Although symmetric encryption is a much

faster process than asymmetric encryption, there is the problem of the originator

of the message making sure the person receiving the message has the same

private

key. The originator has to send the encryption key to the recipient before they

can decrypt the message. This, however, leads to security problems, since this key

couldbe intercepted by anybody and used to decrypt the message. Many companies

overcome this problem by using asymmetric encryption to send the secret key but

use symmetric encryption to encrypt data. So, with symmetric encryption both

sender and recipient have the same secret, private, encryption key which scrambles

the original data and unscrambles it back to an understandable format.

Asymmetric encryption

Asymmetric encryption, sometimes referred to as ‘public-key encryption’, uses

two different keys, one public and one private. A

public key, which is distributed

among many users or computers, is used to encrypt the data. Essentially, this public

key is published for anyone to use to encrypt messages. A private key, which is only

available to the computers, or users, receiving the message, is used to decrypt the

data. When a message is encrypted using the public key, it can be sent across a public

channel such as the internet. This is not a problem as the public key cannot be used

to decrypt a message that it was used to encrypt. It is incredibly complicated, if not

impossible, to guess the private key using the public key and the encrypted message.

Basically, any user who needs to send sensitive data over the internet securely, can do

so by using the public key to encrypt the data, but the data can only be decrypted

by the receiving computer if it has its own private key. Asymmetric encryption is

often used to send emails and to digitally sign documents.

1.3.3Encryptionprotocols

An encryption protocol is the set of rules setting out how the algorithms should be

used to secure information. There are several encryption protocols. IPsec (internet

protocol security)

is one such protocol suite which allows the authentication of

computers and encryption of packets of data in order to provide secure encrypted

communication between two computers over an internet protocol (IP) network.

It is often used in VPNs (virtual private networks). SSH (secure shell) is another

encryption protocol used to enable remote logging on to a computer network,

securely. SSH is often used to login and perform operations on remote computers,

but it can also be used for transferring data from one computer to another. The

most popular protocol used when accessing web pages securely is

transport layer

security (TLS)

. TLS is an improved version of the secure sockets layer (SSL)

protocol and has now, more or less, taken over from it, although the term SSL/TLS

is still sometimes used to bracket the two protocols together.

The purpose of secure sockets layer (SSL)/transport layer security (TLS)

Because TLS is a development of SSL, the terms TLS and SSL are sometimes

used interchangeably. We will use the term SSL/TLS in this book. The three

main purposes of SSL/TLS are to:

» enable encryption in order to protect data

» make sure that the people/companies exchanging data are who they say they

are (authentication)

» ensure the integrity of the data to make sure it has not been corrupted or altered.

1 DATA PROCESSING AND INFORMATION

Two other purposes are to:

» ensure that a website meets PCI DSS rules. The Payment Card Industry

Data Security Standard (PCI DSS) was set up so that company websites

could process bank card payments securely and to help reduce card fraud.

This is achieved by setting standards for the storage, transmission and

processing of bank card data that businesses deal with. Later versions of TLS

are required to meet new standards which have been imposed.

» improve customer trust. If customers know that a company is using the SSL/

TLS protocol to protect its website, they are more inclined to do business

with that company.

Many websites use SSL/TLS when encrypting data while it is being sent to

and from them. This keeps attackers from accessing that data while it is being

transferred. SSL/TLS should be used when storing or sending sensitive data

over the internet, such as when completing tax returns, buying goods online,

or renewing house and car insurance. Only going to websites which use SSL/

TLS is good practice. The SSL/TLS protocol enables the creation of a secure

connection between a web server and a browser. Data that is being transferred

to the web server is protected from eavesdroppers (the name given to people

who try to intercept internet communications).

The SSL/TLS protocol verifies the identity of the

server. Any website with an

HTTPS address uses SSL/TLS. In order to verify the identity of the server, the

protocol makes use of digital certificates, which contain such information as the

domain name that the certificate is issued for, which organisation, individual or

device it was issued to, the certificate authority (CA) that issued it, the CA’s digital

signature, and the public key, as well as other items. Although SSL was replaced by

TSL many years ago, these certificates are still referred to as SSL certificates today.

As well as keeping the user’s data secure, a website needs a digital certificate in order

to verify ownership of the website and also to prevent fraudsters creating a fake

version of the website. Valid SSL certificates can only be obtained from a CA. CAs

can be private companies or even governments. Before allowing someone to have

an SSL certificate, the CA will carry out a number of checks on an applicant and

following that, it is the responsibility of the CA to make sure that the company or

individual receives a unique certificate. Unfortunately, if hackers are able to break

through a CA’s security, they can start issuing bogus certificates to users and will

then be in a strong position to crack the user’s encryption.

The use of SSL/TLS in client–server communication

Transport layer security (TLS) is used for applications that require data to

be securely exchanged over a client–server network, such as web browsing

sessions and file transfers. Just like IPsec it can enable VPN connections and

Voice over IP(VoIP).

In order to open an SSL/TLS connection, a

client needs to obtain the public

key. For our purposes, we can consider the client to be a web user or a web

browser and the server to be the website. The public key is found in the server’s

digital certificate. From this we can see that the SSL/TLS certificate proves that

a client is communicating with the actual server that owns the domain, thereby

proving the authenticity of the server.

When a browser (client) wants to access a website (server) that is secured by

SSL/TLS, the client and the server must carry out an SSL/TLS handshake. A

handshake, in IT terms, happens when two devices want to start communicating.

One device sends a message to the other device telling it that it wants to set up

a communications channel. The two devices then send several messages to each

1.3 Encryption

other so they can agree on the rules for communicating (a communications

protocol). Handshaking occurs before the transfer of data can take place.

With an SSL/TLS handshake, the client sends a message to the server telling it

what version of SSL/TLS it uses together with a list of the different ciphersuites

(types of encryption) that the client can use. The list of ciphersuites has the client’s

preferred type at the top and its least favourite at the bottom. The server responds

with a message which contains the ciphersuite it has chosen from the client’s list.

The server also shows the client its SSL certificate. The client then carries out a

number of checks to make sure that the certificate was issued by a trusted CA and

that it is in date and that the server is the legitimate owner of the public and private

keys. The client now sends a random

string of bits that is used by both the client

and the server to calculate the private key. The string itself is encrypted using the

server’s public key. Authentication of the client is optional in the process. The client

sends the server another message, encrypted using the secret key, telling the server

that the client part of the handshake is complete. We will see in more detail in the

section on HTTPS how any further transmitted data is encrypted.

1.3.4Usesofencryption

There are many reasons to encrypt data. Companies often store confidential

data about their employees, which could include medical records, payroll data,

as well as personal data. These need to be encrypted to prevent them becoming

public knowledge. An employee in a shared office may not want others to

have access to their work which may be stored on a hard disk, so it needs to be

encrypted. A company’s head office may wish to share sensitive business plans

with other offices using the internet. If the data is encrypted, they do not have

to worry about what would happen if it were intercepted. Company employees

and individuals may need to take their laptops or other portable devices with

them when they travel for work or pleasure. If the device contains sensitive

information which is not encrypted, it is possible that the information could be

retrieved by a third party if the device is left unattended.

As recently as 2018, it was reported that over the previous four years, staff

in five UK government departments had lost more than 600 laptops, mobile

phones and memory sticks. Fortunately, the data had been encrypted and was

therefore protected from prying eyes. Unfortunately, there have been occasions

where laptops have been left on trains and the data was unencrypted, causing

great embarrassment to the government when this was discovered.

There are other situations where encryption should be used. One example is

when individuals are emailing each other with information they would want

to remain confidential. They need to prevent anybody else from reading and

understanding their mail. People use websites for online shopping and online

banking. When doing so, the debit/credit card and other bank account details

should be encrypted to prevent fraudulent activity taking place.

Let us now consider three specific uses of encryption.

Hard-disk encryption

The principle of hard-disk encryption is fairly straightforward. When a file is

written to the disk, it is automatically encrypted by specialised software. When a

file is read from the disk, the software automatically decrypts it while leaving all

other data on the disk encrypted. The encryption and decryption processes are

understood by the most frequently used application software such as spreadsheets,

databases and word processors. The whole disk is encrypted, including data files,

the OS and any other software on the disk. Full (or whole) disk encryption is your

1 DATA PROCESSING AND INFORMATION

protection should the disk be stolen, or just left unattended. So, even if the disk is

still in the original computer, or removed and put into another computer, the disk

remains encrypted and only the keyholder can make use of its contents.

Another benefit of full disk encryption is that it automatically encrypts the data

as soon as it is saved to the hard disk. You do not have to do anything, unlike

the encryption of files and folders, where you have to individually encrypt them

as you go.

There are, however, drawbacks to encrypting the whole disk. If an encrypted disk

crashes or the OS becomes corrupted, you can lose all your data permanently or,

at the very least, disk data recovery becomes problematic. It is also important to

store encryption keys in a safe place, because as soon as a disk is fully encrypted,

no one can make use of any of the data or software without the key. Another

drawback can be that booting up the computer can be a slower process.

Email encryption

When sending emails, it is good practice to encrypt messages so that their

content cannot be read by anyone other than the person they are being sent

to. Many people might think that having a password to login to their email

account

is sufficient protection. Unfortunately, emails tend to be susceptible to

interception and, if they are not encrypted, sensitive information can become

readily available to hackers. In the early days of email communication, most

messages were sent as plaintext, which meant hackers could easily read their

content. Fortunately, most email providers now provide encryption by default.

There are three parts to email encryption.

1 The first is to encrypt the actual connection from the email provider, because

this prevents hackers intercepting and acquiring login details and reading any

messages sent (or received) as they leave (or arrive at) the email provider’s server.

2 Then, messages should be encrypted before sending them so that even if a

hacker intercepts the message, they will not be able to understand it. They

could still delete it on interception, but this is unlikely.

3 Finally, since hackers could bypass your computer’s security settings, it is

important to encrypt all your saved or archived messages.

Asymmetric encryption is the preferred method of email encryption. The email

sender uses the public key to encrypt the message while the recipient uses the

private key to decrypt it. It is considered good practice to encrypt all email

messages. If only the ones that are considered to be important are encrypted,

the hacker will know which emails contain sensitive data and will therefore

spend more time and energy trying to decode those particular ones. Encryption

only scrambles the message contents, not the sender’s email address, making it

very difficult to send messages anonymously.

Most types of email encryption require the sending of some form of digital

certificates to prove authenticity. The management of digital certificates, though

time-consuming, is crucial, as users would not want them to fall into the hands

of hackers.

Encryption in HTTPS websites

HTTP (Hypertext Transfer Protocol) is the basic protocol used by web browsers

and web servers. Unfortunately, it is not encrypted and so can cause internet

traffic to be intercepted, read and understood. Hackers could intercept any private

information including bank details and then use these to commit fraud. HTTPS

(Hypertext Transfer Protocol Secure), however, enables users to browse the world

wide web securely. To do this, it uses the HTTP protocol but with SSL/TSL

1.3 Encryption

encryption overlaid. HTTPS websites have a digital certificate issued by a trusted

CA, which means that users know the website is who it says it is; as we mentioned

in the section on SSL/TLS, it proves it is authentic. It again ensures the integrity of

the data by showing that web pages have not been changed by a hacker while being

transferred and by encrypting any information transferred from the client to the

server, and vice versa. As a result of the use of HTTPS, users can securely transmit

confidential information such as credit card numbers, social security numbers and

sent as plain text, which could easily be intercepted by a hacker or fraudster.

Indicators that you are using a secure site are the inclusion of the HTTPS:// prefix

as the starting part of the URL. There should also be a padlock icon next to the

URL. Depending on your browser and the type of certificate the website has

installed, the padlock may be green. The way HTTPS works is that the web browser

on the client computer performs a handshake with the web server, as described

earlier in the section on SSL/TLS. Then, what is sometimes called a session key is

created randomly by the web browser and is encrypted using the public key, then

sent to the server. The server then decrypts the session key using its private key.

All data sent between the two from then on is encrypted using this session key.

So, the generation of the session key is done through asymmetric encryption, but

symmetric encryption is used to encrypt all further communications. Asymmetric

encryption requires a lot of processing power and time, so HTTPS uses a

combination of asymmetric and symmetric encryption. Once the session is finished,

each client and the server discard the symmetric key used for that session, so each

separate session requires a new session key to becreated.

One of the benefits of HTTPS, obviously, is security of data, with information

remaining confidential because only the client browser and the server can

decrypt it. Another benefit of HTTPS is that search engine results tend to rank

HTTPS sites higher than HTTP sites. However, the time required to load an

HTTPS website tends to be greater. Websites have to ensure that their SSL

certificate has not expired and this creates extra work for the host as it has to

keep on top of certificate management.

1.3.5Advantagesanddisadvantagesofdifferentprotocols

and methods of encryption

There are a number of advantages of using encryption. As was said earlier, if

personal information is sent across the internet, whether it is credit card information

or personal details, once it is encrypted the information can no longer be changed

or used for purposes such as identity theft or cyber-fraud or ransomed off. Company

secrets would not be able to be sold by hackers to rival companies. One drawback

with encryption, however, is that it takes more time to load encrypted data as well

as requiring additional processing power. When browsing, the client and server

must send messages to each other several times before any data is transmitted. This

increases the time it takes to load a webpage by several milliseconds. It also uses up

valuable memory for both the client and the server. Encryption involves the use of

keys and while a larger key size means more effective encryption, it also increases

the computational power required to perform the encryption. Encryption is meant

to protect data, but it can also be the source of great inconvenience to a computer

user.

Ransomware can be used against individual computer users; hackers can

encrypt computers and servers and then demand a ransom. If the ransom is paid,

they provide a key to decrypt the encrypted data. Another problem with encryption

is that if the private key is lost, it is extremely difficult to recover the data and, in

certain circumstances, the data may well be lost permanently. It is possible for the

1 DATA PROCESSING AND INFORMATION

data to be recovered by the reissuing of the digital certificate, but this can take time

and, in the meantime, if a hacker has managed to get hold of the key, they will

have full access to the encrypted data. In addition, users can get careless and forget

that decrypted data should not be left in the decrypted state for too long, as it then

becomes susceptible to further attack from hackers.

Regarding the different forms of encryption, asymmetric is much slower compared

to symmetric due to its mathematical complexity, and it is therefore not suitable for

computing vast amounts of data. It also requires greater computational power.

While it is difficult to give the advantages and disadvantages of individual

protocols since they all do a similar job, it is possible to compare the advantages

and disadvantages of SSL/TLS with IPsec for setting up a VPN, for example.

With SSL/TLS, digital certificates are only essential with the server (client

digital certificates are optional), whereas with IPsec both client and server have

to be authenticated, which makes it more difficult to manage an IPsec system.

However, as user authentication is optional, this means that security is weakened

with SSL/TLS compared to IPsec.

Extra software has to be downloaded when using SSL/TLS if non-web-based

applications are used, which may be a problem if a

firewall prevents or slows

down access to these downloads. VPN tunnels using SSL/TLS are not supported

by certain operating systems which do support IPsec. In conclusion, the time-

consuming management of digital certificates is less of a problem with SSL

compared to IPsec, which could lead to saving of money. Another cost implication

is that, unlike most uses of IPsec, you do not need to buy client software and the

process of setting up and managing such a system tends to be easier.

Activity 1d

1 Briefly describe what is meant by symmetric encryption.

2 Write a sentence about each of the three uses of encryption.

3 Explain the difference between HTTP and HTTPS.

1.4 Checking the accuracy of data

1.4.1 Validation and verification

It is vital that data is input accurately to ensure that the processing of that data

produces accurate results. When using a computer system, data entry is probably

the most time-consuming part of data processing. Consequently, it is important

to try and ensure that the number of errors which occur when entering data

directly or transferring from another medium is very small, otherwise more

time will need to be spent correcting the data or even re-entering it. To try

and ensure that data entry is accurate, we use two methods called validation

and verification. Neither method ensures that the data being entered is correct

or is the value that the user intended to enter. Verification tries to ensure that

the actual process of data entry is accurate, whereas validation ensures the

data values entered are reasonable. It may well be that the original data when

collected was incorrect; here we are just ensuring that no further errors occur

during the transfer process.

Methods of validation

As stated above, data validation simply ensures that data is reasonable and sensible.

In the early days of computerisation there were some horror stories of people

getting utility bills for 1 million dollars! This was usually the result of poor

1.4 Checking the accuracy of data

checking being carried out and nobody noticing that the decimal point had been

put in the wrong place. A validation check that ensured nobody could have a bill

of more than $1000 would have stopped this. However, it would not prevent

somebody getting an incorrect bill of $321 instead of $231, as the amount being

charged would still be regarded as sensible or reasonable. To emphasise then,

validation ensures data is reasonable or sensible but not necessarily correct.

In order to ensure that data input to a system is valid, it is essential to

incorporate as many validation routines as possible. The number and methods of

validation routines or checks will obviously depend on the form or type of input

to the system. Not every

field can have a validation check. For example, there

are so many variations of an individual’s name that this would be very difficult

to validate. Some have letters that might not be recognised by a computer using

an alphabet based on the English language and some contain punctuation

marks such as apostrophes and hyphens.

In order to illustrate the types of validation, let us consider a school library

database which has one table for books and another table for upper-school

students/borrowers. Let us look at an extract showing some of the typical

books and borrowers from the complete database. We shall assume that these

records are representative of the whole database.

ISBN Title Author Published Cost

9781474606189 The Labyrinth of the Spirits Carlos Ruiz Zafón 2016 $10

9780751572858 Lethal White Robert Galbraith 2018 $18

9781780899329 18th Abduction James Patterson 2019 $29

9781408711095 The Colours of All the Cattle Alexander McCall Smith 1997 $26

V Table1.2Booksdatabasetable

Borrower_ID Name Date_of_birth Class Book_borrowed

0205 Chew Ming 21/12/04 11D 9781474606189

1016 Gurvinder Sidhu 19/11/05 10A 9781408711095

0628 Gary Goh 18/04/06 10C 9781474606189

1014 Jasmine Reeves 13/02/05 11A 9780751572858

V Table1.3Years10and11upper-schoolborrowersdatabasetable

Presence check

When important data has to be included in a database and must not be left out,

we use a presence check to make sure data has been entered in certain fields.

A common mistake made by many students when asked which validation check

should be used on a field in a database, is to always say ‘presence check’. It

is the easiest to remember but is rarely used on any field except the key field.

With many of the fields illustrated above, it is not necessary to use a presence

check. The data can be updated or entered at a later date. We would, however,

probably use a presence check on the ISBN field in the Books table and on the

Borrower_ID field in the Borrowers table. These are key fields and, if data were

not entered, it would be very difficult to identify unique records if that data

were missing. You may well have come across this when completing online data-

capture forms. Fields sometimes have a red asterisk next to them, along with a

message instructing users that fields marked with an asterisk must be completed.

Presence checks are frequently used to check that you have entered data into

these fields and, if you have not, then there is usually a warning message saying

1 DATA PROCESSING AND INFORMATION

you must enter the data before you can move on to the next page. While the

presence check prevents you from missing out a field or record, it is otherwise a

fairly inefficient method, as it does not prevent you from entering incorrect or

unreasonable data.

Range check

Fields which contain numeric data tend to have

range checks designed for them.

Using the extract from the Books table above, we can see that the maximum

cost of a book is $29 and the minimum cost is $10. We can make sure that the

person entering the data does not enter a cost less than $10 or more than $29

by using a range check. A range check always has an upper value and a lower

value which form the range of acceptable values. If a value is entered that is

outside this range, an error message is produced. We will deal with the limit

validation check later in this section. Suffice to say at this point that a

limit

check

only has one boundary, which is either upper or lower.

So, to sum up, a range check checks that the data is within a specified range of

values. In our example, the cost of a book should lie between 10 and 29. A range

check could be carried out on each part of the Date_of_birth field in the Borrowers

table to ensure that 32 or 0 was not entered in the day part or that 13 or 0 was not

entered for the month part. However, this would not prevent impossible dates such

as 31/2/05 being entered (February can only have day values of 29 or less).

Type check

type check ensures that data is of a particular data type. In the above example,

Borrower_ID in the Borrowers table could be set up so that a validation rule would

ensure that every character was

numeric. Most students think that you only have

to set the field data type to numeric to prevent text from being entered. While

this is true, it would not be sufficient with the Borrower_ID field, as any leading

zeros, such as with 0205, would be removed in such a field. This would also not be

acceptable in most parts of the world in fields containing telephone numbers. The

data type of Borrower_ID would have to be set to

alphanumeric and a validation

routine would need to be created to only allow the digits 0 to 9 to be entered.

Fields in some databases would contain letters of the alphabet only. Setting the field

type to alphanumeric does not prevent numbers from being entered. A separate

validation routine would need to be set up. A type check can be performed on

most fields to make sure that no invalid characters are entered. For that reason, it is

often referred to as an

invalid character check, or by its shortened name, character

check

. One of the shortcomings of this check is that it would not alert you if you

had not typed in the correct number of characters in a particular field.

Length check

length check is performed on alphanumeric fields to ensure they have the

correct number of characters, but it is not used with numeric fields. Generally,

it is carried out on fields that have a fixed number of characters, for example the

length of a telephone number in France tends to be 10 digits. As the leading

digit is 0, the phone numbers would be stored in the alphanumeric data type, so

it is fairly straightforward to apply a length check.

Again, the application of this type of validation check is not to be confused with

setting up the data structure of a file so that the field length is a fixed number

of characters, as this does not prevent a user from typing in a string of less than

that fixed number of characters. In this instance no error message would be

produced, but with a length check, if you did not type in a correct number of

characters, an error message would result.

1.4 Checking the accuracy of data

In our database, we would apply a length check to the ISBN field in the Books

table. All ISBNs in our table are 13 characters long, so we need to have a check

to make sure we are not typing in fewer or more than 13 characters. We could

also apply this type of check to the Borrower_ID field in the Borrowers table

as these all appear to be four characters in length. This would not give us full

validation on the field, however, as letters of the alphabet, if entered, would not

be flagged up as an error, as the check only counts the number of characters.

It is also worth noting that length checks can be set to a range of lengths. For

example, phone numbers in Ireland tend to vary between 8 and 10 characters in

length. In that instance, you would need a routine that would not allow you to

type in fewer than eight characters or more than ten characters. This is not to be

confused with the range check described above.

Format check

New vehicle registration or licence plates in the UK follow an identical pattern.

The format is two letters followed by two digits followed by a space and then

three letters, for example, XX21 YYY. Any new registration number when

entered into a database can therefore be validated using a

format check or

picture check. If a combination of characters is entered which does not follow

this pattern, it should produce an error message.

In a format check, you can specify any specific combination of characters that

must be followed. In our example database, we could use a format or picture

check on the Class field in the Borrowers table. We could set it so that it must

be two digits followed by one letter. If somebody did not realise that it was the

Upper-school borrowers table and typed in 9B, for example, this would cause

an error message to be output saying that the field must have two numbers

followed by a letter. While a format check is very useful in this scenario, it would

not prevent somebody mistyping an entry, and entering 19A for example, by

mistake, as this would be accepted by the system. This form of validation is very

useful for checking dates if they are in a specific format, such as Date_of_birth

in the Borrowers table, which consists of two digits followed by a slash, followed

by two digits followed by a slash, followed by two digits.

Many students mistakenly think that you can put a format check on a

currency

field. This is not the case. The data is stored as a number and the currency symbol is

added by the software. The user is only entering numbers not the currency symbol.

Check digit

For our purposes we will consider the use of a

check digit as a means of

validating data as it is being input (some people would argue that it can be

used as a verification check in certain other circumstances, but that is outside

the scope of this book). It is used on numerical data, which is often stored as a

string of alphanumeric data-type. For example, the last digit of an ISBN for a

book is a check digit calculated using simple arithmetic. There are a number of

ways of calculating the check digit but one of the most frequently used methods

is described here. Using the first 12 digits in the ISBN, each individual digit

in the string is multiplied by 1 if it is in an odd-numbered position (such as

1st, 3rd and so on) or 3 if it is in an even-numbered position (such as 2nd, 4th

and so on). The resulting numbers are added together and divided by 10. If

the remainder is 0 that becomes the check digit, otherwise the remainder is

subtracted from 10 and that becomes the check digit. This is then added to the

end of the string, in this case as the 13th digit of the ISBN.

1 DATA PROCESSING AND INFORMATION

This happens at a stage before the data is entered, for example when the ISBN is

allocated before a book is published. When this data comes to be entered into a

database in a library, the computer recalculates the check digit to check whether it

gives the same check digit. If it does not, then an error message is produced. This

usually happens when the person entering the data has transposed two digits. In

our example if we typed in 9781447606189 instead of 9781474606189, the check

digit would be recalculated from the first 12 digits and would produce the check

digit 5 instead of 9. This would produce an error message.

Lookup check

A lookup check compares the data that has been entered with a limited number

of valid entries. If it matches one of these then it is allowed, but if it does not then

an error message is produced. It can only be used efficiently if there are a limited

number of lookup values, such as the days of the week, where there are only seven.

A lookup check is not to be confused with setting up a lookup list for the user to

select data items from when entering data. A lookup list does not produce error

messages and in certain circumstances users can still enter data which overwrites

the given list. In our database, we could use the Class field in the Borrowers table,

which would probably only have the classes 10A, 10B, 10C, 10D, 11A, 11B, 11C,

11D to choose from. If, using the example above, somebody entered 9B, the

computer would compare this to the values stored in a separate table and would

not be able to find a match and would produce an error message.

Consistency check

This is sometimes called an integrity check, but for our purposes, so that we

do not get confused with referential integrity (which we will meet later in the

book), we will refer to it as a

consistency check. It checks that data across

two fields is consistent. A good example would be to ensure data consistency

between the Class field in the Borrowers table and the Date_of_birth field. We

will assume that each student is allocated to a class according to their age when

they join the school. We will assume that, in our example, students in year 11

were born between 1 September 2004 and 31 August 2005. A consistency

check can be applied here so that if the first two digits of the class are 11 then

the student’s Date_of_birth must be between 01/09/04 and 31/08/05. If this

is not the case, then this validation check would output an error message. It is

often applied so that a field that contains a person’s age must be consistent with

a field that contains their date of birth, though it should be stressed that storing

the age of a person is considered to be bad practice as it changes regularly and

needs updating often.

Limit check

We have already considered the range check; now we will look at a limit check.

A limit check is similar to a range check, but the check is only applied to one

boundary. For example, in the UK you are only allowed to drive from the age

of 17, but there is no upper limit. If somebody enters a number lower than 17

when asked to enter their age when applying for a driving licence, for example,

this will generate an error message. In our database it is difficult to apply a limit

check given the data provided.

The need for both validation and verification

The two methods of checking the accuracy of data are complementary. We have

seen that verification can report on errors that validation cannot and, similarly,

validation will pick up errors that verification cannot. Both are needed to ensure

that data is sensible and reasonable in the first place and also transferred accurately.

1.4 Checking the accuracy of data

The difference between validation and verification

As we saw earlier both these methods are essential to ensuring the entry of data is

accurate. It is important to emphasise that we are checking that the data was entered

accurately; we are not checking that the data itself is accurate. There are a number

of differences between verification and validation. Validation is always carried

out by a computer whereas verification can be carried out by a computer or by a

human. Validation is checking that the data is reasonable and sensible. Verification

is checking that the data has been copied or entered correctly but cannot tell

you whether the data is sensible or not. Similarly, validation does not help if you

have copied the data incorrectly. If you type in FD236CS instead of DF236CS, a

format check would accept this as valid input (as it is still two letters followed by

three

numbers followed by two letters) even though it has been copied incorrectly.

Verification would alert the user to this error. Data may have been invalid when

collected but verification only helps you to know that the data has been transferred

accurately to another medium. It does not help if the original data is incorrect.

Consider an electricity company which employs meter readers to read customers’

electricity meters. Suppose the meter reader accidentally writes down that, for

one customer, the number of units used was 4866 instead of the actual reading,

which was 4860. When the readings for all the customers have been collected,

they are entered into the computer, including the incorrect reading of 4866.

At this stage all verification would do is check that the number entered was the

same as that in the source document, 4866, so incorrect data would pass the

verification test. The validation check might be that readings must be between

2000 and 6000. Again, incorrect data would pass this test as well. This shows

how important it is that the correct data is collected in the first place, since

verification and validation might still allow the data to pass through the system

undetected. Verification is a way of ensuring that the user does not make a

mistake when inputting data whereas validation is checking that the data input

conforms with what the system considers to be sensible and reasonable.

Verification

As already stated, verification simply ensures that data has either been entered

accurately by a human or that it has been transferred accurately from one storage

medium to another. There are a number of methods of verification, some related

to manual entry of data and some related to data transfer.

Visual checking

Visual checking is carried out by the person who enters the data, who visually

compares the data they have entered with that on the source document. They

can see the differences and then correct the mistakes. This is the simplest form of

verification and can be done by reading the data on the screen to make sure it is the

same as the source document. An alternative method is to print out the data entered

and compare the printout side by side with the source document. Visual checking

can be rather time-consuming and possibly costly as a result. Another problem is

that the person who is checking that the data has been entered correctly may be the

same person who entered it. It is very easy for them to overlook their own mistakes.

A possible way around this is to get somebody else to do the check.

Double data entry

Double data entry, as the name suggests, involves the entry of data twice. The

first version is stored. The second entry is compared to the first by a computer,

and the person entering the data is alerted by the computer to any differences.

The user then checks to see if the second attempt is correct, corrects the error if

necessary, and continues entering the data.

1 DATA PROCESSING AND INFORMATION

The alternative to this way of entering data twice is for two different people

to enter the data, which is temporarily saved to the same hard disk. The

computer compares the two versions on the disk and alerts both operators to

any differences, which are then checked to see which version is correct. Some

systems cause the keyboards to freeze so that the people entering the data

cannot continue until the mistake is corrected.

Double data entry is similar to visual verification in that both methods compare

two versions of data and check that data is copied accurately, not checking

that the data collected in the first place was accurate or correct. The essential

difference is that visual verification is carried out by the user, whereas with

double data entry it is the computer that compares the two versions.

Parity check

As was mentioned in Section 1.1.1 the computer stores data in the form of

bits. Each string of bits is called a byte, with a byte normally consisting of

8 bits. Each bit is either 1 or 0, for example 10001101. A byte represents a

number between 0 and 255. Most computers use the

American Standard

Code for Information Interchange (ASCII)

. It is a code which uses numbers

to represent 96 English-language characters, with each character being given

a number between 32 and 127. The first 32 codes (0–31) in ASCII are

unprintable control codes and are used to control peripherals such as printers.

For example, 0 represents the

null string, 8 is equivalent to the backspace key

on the keyboard and 13 is equivalent to the return key. The codes 32–127

represent letters of the alphabet, numbers, and other symbols such as $, %, &.

So, for example, the ASCII code for uppercase I is 73, which is 01001001 in binary.

If we wanted to represent the word BROWN, we would use the ASCII codes

66, 82, 79, 87 and 78, which would in turn be represented by the following

bytes: 0100001001010010010011110101011101001110

In the early days of computers, there were only 7 bits used to contain information

with the extra bit used for a parity check. It soon became apparent that extra

characters needed to be represented in the system, such as the Spanish ñ, the

French è, the symbol ©. These form what is called

extended ASCII and provide

characters represented by the codes 128–255. This provided 128 additional

characters. Now, the

parity bit was added to make 9 bits. It is possible to use the

seven bits of the byte to represent data and have a parity bit to make it 8 bits, but

this would give us a limited set of characters to work with. In this section, we will

only be considering 9-bit parity checking.

Most computers use ASCII codes to represent text, which makes it possible to

transfer data from one computer to another. There has to be, however, a way to

check that data has been transmitted accurately. This is called parity checking,

which involves the use of parity bits. The parity bit is added to every byte (8

bits) that is transmitted. The parity bit is added to the end of the byte so that

there are an even number of 1s. Some systems use an odd number of 1s but we

will be looking at the most commonly used, which is even parity.

When data is being transmitted from one device to another, the sending device

counts the number of 1s in each byte. If the number of 1s is even, it sets the

parity bit to 0 and adds this on to the end of the byte. However, if the number

of 1s is odd, it sets the parity bit to 1 and adds it on. The result is that every

byte of transmitted data consists of an even number of 1s. When the other

device receives the data, it checks each byte to make sure that it has an even

number of 1s. If there is an odd number of 1s, then this means there has been

an error during the transfer of data. So how does this all work?

1.4 Checking the accuracy of data

Consider the example given above: the word BROWN

B – 01000010 there are two 1s (even) so we add 0; it now becomes 010000100

(two 1s [even])

R – 01010010 there are three 1s (odd) so we add 1; it now becomes 010100101

(four 1s [even])

O – 01001111 there are five 1s (odd) so we add 1; it now becomes 010011111

(six 1s [even])

W – 01010111 there are five 1s (odd) so we add 1; it now becomes 010101111

(six 1s [even])

N – 01001110 there are four 1s (even) so we add 0; it now becomes 010011100

(four 1s [even])

This is a very effective verification method. It makes sure that all bytes have

an even number of 1s. If a 1 within the byte is transmitted as a 0, then the

error will be trapped by the system. However, there are still errors which can

go undetected. If two 1s within the byte get transmitted as 0s, the byte will

still have an even number of 1s and so the system will not report an error. For

example, if 010101111 (W with parity bit added) is transmitted as 010001101

(F with parity bit added) the parity check will not notice this, as there is still

an even number of 1s. Also, if, somehow, a 1 and a 0 are transposed, such as

010000100 (B with a parity bit added) being transmitted as 010000010 (A with

a parity bit added), again the parity check will not report an error as there is still

an even number of 1s. More complex error-checking methods have had to be

developed, but parity checking is still very common because it is such a simple

method for detecting errors.

Checksum

Checksums are a follow-on from the use of parity checks in that they are used

to check that data has been transmitted accurately from one device to another.

A checksum is used for whole files of data, as opposed to a parity check which

is performed byte by byte. They are used when data is transmitted, whether

it be from one computer to another in a network, or across the internet, in an

attempt to ensure that the file which has been received is exactly the same as the

file which was sent.

A checksum can be calculated in many different ways, using different

algorithms, for example a simple checksum could simply be the number of bytes

in a file. Just as we saw with the problem with transposition of bits deceiving a

parity check, this type of checksum would not be able to notice if two or more

bytes were swapped; the data would be different, but the checksum would

be the same. Sometimes, encryption algorithms are used to verify data; the

checksum is calculated using an algorithm called a hash function (not to be

confused with a

hash total, which we will be looking at next) and is transmitted

at the end of the file. The receiving device recalculates the checksum, and then

compares it to the one it received, to make sure they are identical.

Two common checksums algorithms are MD5 and SHA-1 but both have been

found to have weaknesses. It is possible for two different files to have the same

calculated checksum, so because of this a newer SHA-2 and even newer SHA-3

have been developed which are much more reliable.

The actual checksum is produced in hexadecimal format. This is a counting

system that is based on the number 16, whereas we typically count numbers

based on 10. You can see what each hexadecimal value represents in this table.

1 DATA PROCESSING AND INFORMATION

Hexadecimal 0123456789ABCDEF

Base 10 0123456789101112131415

V Table 1.4 Hexadecimal values

MD5 checksums consist of 32 hexadecimal characters, such as

591a23eacc5d55a528e22ec7b99705cc. These are added to the end of the file.

After the file is transmitted, the checksum is recalculated by the receiving device

and compared with the original checksum. If the checksum is different, then the

file has probably been corrupted during transmission and must be sent again.

Hash total

This is similar to the previous two methods in that a calculation is performed

using the data before it is sent, then it is recalculated, and if the data has

transmitted successfully with no errors, the result of the calculation will be the

same. However, this time the calculation takes a different form; a hash total is

usually found by adding up all the numbers in a specific field or fields in a file.

It is usually performed on data not normally used in calculations, such as an

employee code number. After the data is transmitted, the hash total is recalculated

and compared with the original value. If it has not been transmitted properly or

data has been lost or corrupted, the totals will be different. Data will have to be

sent again or the data will have to be visually checked to detect the error.

This type of check is normally performed on large files but, for demonstration

purposes, we will just consider a simple example. Sometimes, school

examinations secretaries are asked to do a statistical analysis of exam results.

Here we have a small extract from the data that might have been collected.

Student ID Number of exam passes

4762 6

0153 8

2539 7

4651 3

V Table 1.5 Sample data

Normally, the Student ID would be stored as an alphanumeric type, so for the

purpose of a hash check, it would be converted to a number. The hash check involves

adding all the Student IDs together. In this example it would perform the calculation

4762 + 153 + 2539 + 4651 giving us a hash total of 12105. The data would be

transmitted along with the hash total and then the hash total would be recalculated

and compared with the original to make sure it was the same and that the data had

been transmitted correctly. We would use a hash total here because there is no other

point to adding the Student IDs together. Apart from verification purposes, the

hash total produced is meaningless and is not used for any other purpose.

Control total

control total is calculated in exactly the same way as a hash total, but is only

carried out on numeric fields. There is no need to convert alphanumeric data to

numeric. The value produced is a meaningful one which has a use. In our example

above, we can see that it would be useful for the head teacher to know what the

average pass rate was each year. The control total can be used to calculate this

average by dividing it by the number of students. The calculation is 6 + 8 + 7 + 3

giving us a control total of 24. If that is divided by 4, the number of students, we

1.5 Data processing

find that the average number of passes per student is 6. Obviously, the control total

check is usually carried out on much larger volumes of data than our small extract.

The use of a control total is the same as for a hash total in that the control total is

added to the file, the file is transmitted and the control total is recalculated. Just

as with the hash total, if the values are different, it is an indication that the data

has not been transmitted or entered correctly. However, both types of check do

have their shortcomings. If two numbers were transposed, say student 4762 was

entered as having 8 passes and 0153 with 6 passes, this would obviously be an

error but would not be picked up by either a control or hash total check.

Activity 1e

1 Briefly describe what is meant by verification.

2 Write a brief description for each of three different validation checks.

1.5 Data processing

As we saw in Section 1.1, data must be processed so that it can become

information. Data can include personal data, transaction data, sensor data and

much more. Data processing is when data is collected and translated into usable

information. Data processing starts with data in its raw form and translates

it into a more readable format such as diagrams, graphs, and reports. The

processing is required to give data the structure and context necessary so it can

be understood by other computers and then used by employees throughout an

organisation. There are several different methods of data processing, but the

three most popular ones are batch, online, and real-time. We will now consider

each of these in turn.

1.5.1 Batch processing

In business, a transaction occurs when someone buys or sells something, but

in IT the term ‘

transaction’ can mean much more, such as adding, deleting or

changing values in a database.

Batch processing is still used today as it is an effective method of processing

large volumes of data; sometimes millions of transactions are collected over a

period of time. The data is entered and processed altogether in one batch by

the computer, which then produces the required results. The processing of

one batch is often called a job. Batch processing allows computers to process

data when computing resources are not being fully utilised, such as overnight,

and requires very little, and often no, human interaction. Examples of batch

processing include payrolls and customer billing systems.

Batch processing provides a number of benefits. One of these is that it allows

a business to process jobs at the times of day when computer resources are

not being used fully, thereby saving the cost of the computer doing very little.

Companies are able to make sure that vital tasks which need immediate attention,

such as online services, can use the computer resources as and when needed.

They can then timetable batch-processing jobs for those times of day when online

processing tasks are fewer. Compared to

real-time processing, batch processing

requires a simpler computer system without complex hardware or software. Once

the system is up and running, it does not need as much maintenance as a real-time

system and, because there is less human interaction, data entry methods used in

batch processing tend to be more accurate. The actual processing, unlike the data

1 DATA PROCESSING AND INFORMATION

collection, is very fast with batch processing, with several jobs being processed

at the same time. Many utility companies such as water, gas and electricity

companies handle huge amounts of data in their billing systems. Batch processing

the collection of data, calculating and printing the bills can be done mainly at

night, so that the computers can be used to help control the actual delivery of the

utility during the busy part of the day.

Master and transaction files

In the next section, we will look at how batch processing is involved in

producing a company payroll. First, we need to be sure we understand that

two separate files are involved in this process. Let us consider the payroll

system of a company that pays its workers weekly. One file is called the

master

file

, which contains all the important data that does not change often, such

as name, works number, department, hourly rate, and so on. The other file,

called the transaction file, contains the data that changes each week, such as

hours worked. The master file would already be sorted in order of the key field

so that processing is easier to perform. The transaction file would probably be

in the order that the transactions were collected. Before these two files can

be processed together to produce the payroll, the transaction file needs to be

sorted in the same order as the master file and will also need to be checked for

errors using validation checks.

A master file would be used together with a transaction file in most batch

processing applications, including a computerised customer orders system (which

we will also be looking at shortly). When data is stored in order of a key field,

it often forms what is called a ‘sequential file’ (the data is in a predetermined

sequence). These used to be stored on magnetic tapes, but magnetic disks and

solid state drives are now used much more than magnetic tapes, which tend, in

modern systems, only to be used for backing up systems. The transaction file will

therefore be stored on disk, even though it holds data in sequential order. When

data is searched for in batch processing, each record is looked at one by one until

the computer finds the record it is looking for. This is called

sequential access.

The steps involved in updating a master file using a transaction file

There are occasions when the data in a master file has to be updated, for example

so that a new worker will get paid. We will assume that the changes will happen

on a weekly basis. It is likely that the transaction file would contain any needed

updates as well as the payroll data and there would only be one transaction file,

but to make it simpler to understand we will assume that the updating of the

master file happens first and then the payroll runs immediately afterwards.

The three types of transaction involved in updating a master file in the scenario

we have outlined are when:

» a worker moves to different department: their record must be amended or

changed

» a worker leaves the company: their record needs to be removed or deleted

» a new worker starts with the company: their record needs to be added.

We can give each type of update a letter. Changing an existing record in the

master file will be C, whereas deleting a record from the master file will be D

and adding a new record to the master file will be A. At the end of each week,

the computer system will process the data stored in the transaction file and

make any changes that are necessary to the master file, thereby producing an

updated master file.

1.5 Data processing

We can demonstrate this, using the following small sample of data.

ID Transaction Employee name Department

2 D Julia Bolero Sales

4C Nigel Ndlovu Buying

7 D Adrienne Pascal IT

11 A John Ward Stores

12 A Paolo Miserere IT

EOF End of file marker

V Table1.6Transactionfile

In order to update the master file, a new blank file will

be created and act as the new master file. The following

very basic algorithm will be followed.

Advice

This algorithm and the subsequent algorithms in this

chapter are simplified versions of what an algorithm might

look like. More efficient ways of writing these and other

algorithms will be covered in Chapter 4.

We will look at REPEAT…UNTIL in more detail in Chapter 4.

1 First record in the transaction file is read

2 First record in the old master file is read

3REPEAT

4 IDs are compared

5 IF IDs do not match, old master file record is

written to new master file

6 IF IDs match transaction is carried out

7 IF transaction is D, old master file record

is not written to new master file

8 IF transaction is C, data in transaction file

is written to new master file

9 IF IDs match, next record from transaction file

is read

10 Next record from master file is read

11 UNTIL end of old master file

12 Data in transaction file record is written to new

master file

13 Any remaining records of the transaction file are

written to the master file

What is happening in this algorithm is that the computer reads the first record in the

transaction file and the first record in the old master file. If the ID does not match,

as it does not in this case (ID is 2 in the transaction file but the ID in the master file

is 1), there is no change necessary. The computer simply writes the old master file

record to the new master file and misses out the next instructions 6 to 9 (IDs do not

match so they can be ignored) then the next record from master file isread.

After the last record, an End of file marker would be stored. As it has not

been read yet we cannot be at the end of the file so the UNTIL statement

ID Employee name Department

1 Jose Fernandez Buying

2 Julia Bolero Sales

3Louis Cordoba Sales

4Nigel Ndlovu Stores

5 Bertrand Couture Buying

6 Lionel Sucio Stores

7 Adrienne Pascal IT

8 Gurjit Mandare Stores

9Iqbal Sadiq IT

10 Tyler Lewis Buying

EOF End of file marker

V Table 1.7 Master file

1 DATA PROCESSING AND INFORMATION

tells the algorithm to go back to the REPEAT instruction. It starts again at

the ‘IF IDs do not match’ in instruction 5. The IDs match in this example

because the ID is 2 in the transaction file and the ID in the master file record

is now 2. Instruction 5 is therefore ignored and instruction 6 indicates that

the transaction is carried out and moves on to instruction 7. In this case the

transaction is D so the record has to be deleted, so the old master file record is

not written to the new master file. Instruction 8 is ignored as the transaction is

not C. Instruction 9 causes the next record from the transaction file to be read.

Instruction 10 means the next record from the master file is read. We again meet

the ‘UNTIL end of old master file’ instruction.

As the algorithm has yet to meet the end of file marker, we go back to the REPEAT

which is instruction 3. We are now looking at the second record of the transaction

file (ID 4) and the third record of the old master file (ID 3). If they do not match

(instruction 5), which they do not, the old master file record is written to the new

master file and we jump to instruction 10 and the next record (the fourth, ID 4) of

the old master file is read.

We are not at the end of the file, so the algorithm takes us back to instruction

3 and then on to instruction 4 then instruction 5: the ‘IF IDs do not match’

instruction. However, this master file record matches with the second record

in the transaction file so the transaction is carried out. It is C, so the master

file record is not copied across, instead the record from the transaction file

(apart from the C) is copied into the new master file. The next record from the

transaction file is read and then the next record from the master file is read.

This carries on until the EOF marker is met in the old master file. The UNTIL

instruction is now true, so the algorithm moves on to instruction 12 so the

current transaction record is written to the master file. Then instruction 13

is followed and the remaining records of the transaction file are added to the

master file, in this case two records.

The steps are carried out as shown, making two assumptions. One is that all

additional records will be at the end of the transaction file. This is usually the

case as new employees would be given the next available ID and the transaction

file would be sorted so these new workers would appear at the end of the file.

The other assumption, which is not really likely in a real situation, is that the

transaction file records will have fields identical to all those in the master file.

To make it easier to follow, we have not included, for example, the rate of pay

field in those records which need to be added, but this field would have to be

included in the master file for every employee.

ID Employee name Department

1 Jose Fernandez Buying

3Louis Cordoba Sales

4Nigel Ndlovu Buying

5 Bertrand Couture Buying

6 Lionel Sucio Stores

8 Gurjit Mandare Stores

9Iqbal Sadiq IT

10 Tyler Lewis Buying

11 John Ward Stores

12 Paolo Miserere IT

EOF End of file marker

V Table1.8Thenewmasterfile

1.5 Data processing

Records 2 and 7 have been deleted.

Record 4 has been changed.

Records 11 and 12 have been added.

Activity 1f

Amend the algorithm on page 27 so that it would work if there were no records

to be added, that is if there were no records in the original transaction file on

page 27 after 7, D, Adrienne Pascal, IT.

Use of batch processing in payroll

As mentioned earlier, batch processing is used to calculate wages in a payroll.

Let us look at a typical master file and transaction file which might be used in a

payroll system. We will only consider a very small company but, in real life, it is

not unusual for payroll systems to cater for thousands of employees.

We can assume that the

transaction file has been

sorted and validated. The

system would have to go

through each transaction file

record and using the hourly

rate (Rate) from the matching

master file record, calculate

that employee’s wages for the

week. This would be added

to the employee’s wages paid

so far this year and replace

the Wages_to_date value. We

shall assume that the workers

pay no tax and have no other

deductions.

1 First record in the transaction file is read

2 First record in the old master file is read

3REPEAT

4 IDs are compared

5 IF IDs do not match, old master file record is

written to new master file

IF IDs match transaction/calculation is carried out

Computer calculates the pay, Rate (from master file)

multiplied by hours worked (from transaction file)

Wages _ to _ date is updated and record is written

to new master file

9 IF IDs match, next record from transaction file

is read

10 Next record from master file is read

11 UNTIL end of transaction file

12 Remaining records of the master file are written to

the new master file

ID Hours_worked

036 40

469 40

578 38

778 40

789 40

EOF End of file

V Table1.9Transactionfile

ID Department Rate ($) Wages_to_date

036 Sales 20 1280

047 Buying 25 1475

165 Buying 25 1525

469 Sales 20 1160

512 Stores 15 825

545 Sales 20 1220

578 IT 30 1860

682 Sales 20 1080

778 IT 30 1920

786 Buying 25 1575

789 IT 30 1830

861 Stores 15 795

EOF End of file

V Table1.10Masterfile

1 DATA PROCESSING AND INFORMATION

The basic outline of the algorithm, to show how the payroll is processed, is

quite similar to the updating algorithm, but the middle part is different:

Notice that because no additional records need to be added to the master file,

we stop the processing when we get to the end of the transaction file.

Activity 1g

Work through this algorithm using the two files shown and write down the new

master file.

Use of batch processing with customer orders

We have already mentioned that customer orders can be dealt with

using batch processing. When a company receives an order from a

customer, it is added to a transaction file. At the end of the day, the

transaction file is used together with a master file to check the items

are in stock and to update the master file with the new number of

items in stock after making deductions arising from the orders. If items

are not in stock, they must be ordered from a supplier. If they are in

stock, then they are added to what is called a picking list which is sent to the

warehouse. The warehouse staff then find the goods that have been ordered and

package them ready for shipment to the customers. The next step is to allocate the

packaged goods to delivery vehicles. At the same time, invoices are produced and

sent to customers for immediate payment or added to their account (if customers

make payments every month). Let us look at how account payments are processed.

The transaction file will contain details of the orders a customer has made in

the last month, together with any payments the customer has made. We will

consider a simple master file which just contains the money owed by each

customer, which is called ‘the balance’.

Again, we can assume that the transaction file is sorted in Cust_no order and

has been validated. The algorithm describing the process would be similar to

the payroll algorithm:

1 First record in the transaction file is read

2 First record in the old master file is read

3REPEAT

4 Cust _ nos are compared

5 IF Cust _ nos do not match, old master file record

is written to new master file

6 IF Cust _ nos match transaction/calculation is

carried out

Computer calculates New _ orders minus Payment _

made and subtracts from Balance

Balance is updated and record is written to new

master file

9 IF Cust _ nos match, next record from transaction

file is read

10 Next record from master file is read

11 UNTIL end of transaction file

12 Remaining records of the master file are written to

the new master file

Cust_no New_orders Payment_made

219 320 200

451 870 1500

523 190 340

834 520 250

V Table 1.11 Transaction file

Cust_no Balance

138 0

187 0

219 0

451 -800

487 -2 60

523 -340

764 0

802 -920

834 0

869 -540

V Table1.12Masterfile

1.5 Data processing

Activity 1h

Work through this algorithm using the two files shown and write down the new

master file.

1.5.2Onlineprocessing

We have seen how batch processing involves gathering data together ready for

processing at a later date. This has an obvious drawback in that the processing

is delayed. In some cases, this is not a problem, for example with applications

such as payroll, which only need to be processed weekly or monthly, or utility

billing systems which are often processed every three months. Some processing,

however, has to be done almost immediately, such as at supermarket checkouts

or interrogating a database for an employee’s details. The original definition

of online processing was that the user was in direct communication with a

central computer. This has now evolved to include any aspect of IT which takes

place over the internet. In this section, we shall be looking at how applications

such as

electronic funds transfer (EFT) and automatic stock control, among

others, take place. One of the differences between batch processing and online

processing, is that in batch processing data is searched using sequential access,

whereas direct access tends to be used in online processing. Direct access is

simply the ability to go straight to the record required without having to read all

the previous records.

Uses of online processing

When data is input into an online system, processing takes place almost

immediately with just a short delay, so short that the user believes they are in

direct communication with the computer. Each transaction is processed before

the next transaction is dealt with. This means that online processing can be

used in a variety of ways. We shall look at some of these here.

Electronic funds transfer

One definition of electronic funds transfer (EFT) is that it is the electronic

transfer of money from one bank account to another using computer-based

systems, without the direct intervention of bank staff. Examples include the

use of an

automated teller machine (ATM), a direct payment of money to

another person, and direct debits, when a company debits the customer’s

bank account for payment for goods or services. EFTs can be transfers

resulting from credit or debit card transactions at a supermarket, a store

or online. They usually involveone bank’s computer communicating with

another bank’s computer, though not always, such as when the ATM being

used belongs to the customer’s bank.

Most people receive their wages as a result of an EFT. Money from the employer’s

bank account is transferred electronically to the employee’s bank account.

EFT has become a common way of paying bills. For example, you may decide

that your house needs redecorating, so you ask a painter to come and paint your

house. When the painter has finished, he or she will require payment. One of

the easiest ways of doing this is to take the painter’s bank details and transfer

the money from your account to the painter’s account.

The following steps describe what happens after you have logged in to your

online bank account, although the process may differ slightly from bank to

bank, and assumes you are paying someone new:

1 DATA PROCESSING AND INFORMATION

1 Select transfer money

2 Select the account you wish to transfer money from

3 Select new payee

4 Type in sort code, account number and payee name

5 Type in amount to transfer

6 Computer checks available balance

7 If you have sufficient funds, the transaction is

authorised

8 Your bank’s computer contacts payee’s bank’s

computer which searches for the payee’s record

9 Amount is subtracted from your account balance

10 Amount is added to payee’s account balance

The most common form of electronic funds transfer is purchasing goods in

a store or supermarket when paying at a checkout. This is called EFTPOS

and stands for Electronic Funds Transfer at Point Of Sale. Checkouts at

supermarkets are called point-of-sale terminals. When a customer goes to a

checkout to pay for their goods, they insert their bank card and the following

steps are followed. (This assumes it is not a contactless transaction, in which

case steps 3 to 8 are omitted. Most countries only allow contactless transactions

if the value of the goods is less than a certain amount.)

1 Card chip is read and checked to make sure it is in

date and it is a valid card number

2 If not, card is rejected and transaction terminated

3 PIN is entered by customer into PIN pad

4 Chip reader determines PIN from the chip

5 The two PINs are compared

6 If they are identical the transaction is authorised

7 If they are not identical, error message appears on

the chip reader and two more attempts are allowed

8 If the two PINs are still not identical, the

transaction is rejected and error message issued

9 Customer’s bank is contacted by supermarket’s computer

10 Customer’s bank retrieves customer’s record

11 Customer’s bank checks if sufficient funds in account

12 If there are insufficient funds then transaction is

rejected

13 If there are sufficient funds then transaction is

authorised

14 The amount of the bill is deducted from customer’s

account

15 The amount of the bill is credited to supermarket’s

account

Automatic stock control

There are many ways in which IT can be used in stock control. In this section we

are going to concentrate on automatic stock control. This involves the use of an

automated system where stock is controlled by a computer with little human input.

We have already met the use of EFTPOS terminals. These also serve another

purpose in stores and supermarkets; they are used for stock control. The

1.5 Data processing

checkout operator swipes the barcode of an item and the computer uses this

to update the stock. The terminal in a supermarket or store consists of a

screen (which can be a touchscreen), a barcode reader to input the barcode

of the product, a number pad to enter the barcode in case the barcode label

is damaged, and electronic scales. Each terminal is connected to a computer

network. The hard disk on the network server stores a file (we shall call it the

product file) containing the records of each product that is sold. Each record

consists of different fields containing data, for example:

» barcode number: the number which identifies each different product; this is

the key field because it is different for each product

» product details: a description, such as tin of beans, packet of teabags and so on

» price of the product

» size: weight or volume of the product

» number in stock: the current total of that product in stock; this changes

every time a product is sold or new stock arrives

» re-order level: the number which the computer will use to see if more of

that product needs re-ordering. If the number in stock falls to this level, the

supermarket or store must re-order

» re-order quantity: when the product needs re-ordering, this is the number of

products which are automatically reordered

» supplier number: the identification number of the supplier which will be used

to look for the details on the supplier file.

There is also a file (the supplier file), which contains details of the supplier of

each product, including their contact details. It is more than likely that these

two files would be stored as separate tables in a

relational database.

The processing involved in automatic stock control is as follows:

1 The product’s barcode is input from the barcode

reader

2 The computer searches for this barcode number in

the product file and finds it using direct access

3 The number in stock is reduced by one

4 The computer then compares the number in stock

with the re-order level

5 If the number in stock is not equal to the

re-order level then go back to step 1 and repeat

6 If the number in stock is equal to the re-order

level then the computer creates an automatic order

7 It looks up the re-order quantity of that product

8 It looks up the supplier number of that product

9 It searches the supplier file for the record

corresponding to the supplier number found in the

product file

10 It sends the order automatically to the supplier

using the supplier’s contact details

11 Go back to step 1 and repeat

When new goods are delivered, the computer automatically updates the product

file by following steps 1 and 2 and then increasing the number in stock by the

re-order quantity.

1 DATA PROCESSING AND INFORMATION

Electronic data exchange

Electronic data exchange is often referred to as Electronic Data Interchange

(EDI). For the purposes of this book we will call it EDI, though the terms are

interchangeable. This is a method of exchanging data and documents without

using paper. The documents can take any form such as orders and invoices, with

the electronic exchange between computers using a standard format. An invoice

is a type of bill sent to a customer containing a list of goods sent or services that

have been provided, including a statement of the sum of money due for these.

Most companies create invoices using a computer system. Many then print a paper

copy of the invoice and post it to the customer. If the customer is a business, they

will often type the details into their computer, meaning that the whole activity

of sending and receiving invoices is actually the transfer of information from the

seller’s computer to the customer’s computer. EDI replaces this activity with an

electronic method. The physical exchange of documents could take between three

and five days. EDI often occurs overnight and can take less than an hour.

The old paper-based method was usually made up of these steps:

» A company decides to buy some goods, creates an order and prints it.

» Company posts the order to the supplier.

» Supplier receives the order and enters it into their computer system.

» Company calls supplier to make sure the order has been received, or supplier

posts a letter to the company to say it has received the order.

An EDI system generally has these steps:

» A company decides to buy some goods, creates an order and does not print it.

» EDI software creates an electronic version of the order and sends it

automatically to the supplier.

» Supplier’s computer system receives the order and updates its system.

» Supplier’s computer system automatically sends a message back to the

company, confirming receipt of the order.

EDI systems save companies money by providing an alternative to systems that

require humans to operate them, thereby saving wages that would be paid to

people who would sort and search paper documents. There is no need to pay

operators to manually enter the data. EDI also reduces human error during data

entry, since there would be no need to re-enter data that had been sent originally.

Productivity is improved as more documents are processed in less time.

EDI systems are often used because of the security aspects of the system. As

alternative systems using the internet have grown, EDI has had to innovate and

this has been achieved largely by increased security in the transmission of data.

EDI is also used by some examination boards to allow exam entries to be made

and for issuing results. It is also used by hospitals to send and receive documents

to and from doctors, again due to the increased security of this system.

Business-to-business buying and selling

Business-to-business – often termed B2B – refers to buying and selling between

two businesses, rather than between a business and an individual customer

(B2C). The value of B2B transactions is noticeably higher than that of B2C, as

businesses are more likely to buy higher-priced goods and services and to buy

more of them than individual customers are. A car manufacturer, for example,

will buy thousands of tyres, whereas a customer is only likely to buy four tyres at

most when replacements are needed.

1.5 Data processing

Many companies still use EDI for sending orders and invoices, but there are other

aspects of B2B which require online processing. Businesses can buy and sell using

online marketplaces, but many B2B sellers do not take advantage of these. There

is sometimes little difference between B2C and B2B marketplaces, for example

Amazon has B2C and B2B versions of its site. However, you are advised to read

the syllabus regarding the use of brand names. B2B marketplaces work just like

a B2C marketplace in that they connect many sellers to buyers. Buyers have the

opportunity to compare and buy products from many different sellers all on one

site. However, a B2B marketplace is different to a B2C marketplace in that bulk

orders can be placed, discounts can be received for ordering large quantities and

orders can be edited online. Sellers have benefits too, such as lower costs since they

do not have to spend as much money on marketing or setting up a larger website,

as the marketplace is responsible for that part of marketing, although the business

has less control over the look of its products on the website. Sellers can save the time

that would be spent on setting up the sales aspect of a website. The audience is now

global and to a certain extent, a captive audience. Marketplaces can also be used to

test out new products by putting a few up for sale. If they sell well, the volume can

be increased and if not, they can be easily withdrawn.

B2C online transactions are fairly straightforward, whereas B2B transactions

tend to be more complicated. B2B selling prices can vary a great deal with

discounts needing to be taken into account. The quantity of products being

sold is much greater, resulting in more complicated shipping requirements. In

addition, B2B tends to have more government regulation and complex taxation.

However, the failure of companies to invest in online buying and selling can

lead to being left behind by competitors who sell more and make greater profits.

Although EDI is still used by many companies, there are other methods of

buying and selling. Companies can have their own selling website that can be

used by other companies to buy their goods. E-procurement is the term used

to describe the process of obtaining goods and services through the internet.

E-procurement software can be used by sellers and buyers to link directly to

each other’s computer systems.

Online stores

Internet shopping began in the 1990s. At that time, however, the vast majority of

the world’s population was not even aware of it. Today, many people shop online,a

development that has arisen due to the emergence of online stores, fromboth

traditional high-street big names and new online-only companies. Online stores

have become many people’s preferred places to shop for many reasons.

» Customers are not rushed by store assistants into hurriedly comparing

products and prices.

» Customers can shop at a convenient time for them.

» Customers do not have to spend time and money travelling around different

shops to find the best bargains; it is much faster and cheaper to do it online.

» If a local branch of a chain of stores has closed, customers can still shop with

that chain online.

» Customers can look at a wide range of shops all around the world.

» Items are usually cheaper online because warehouse and staff costs are lower

than maintaining main street stores.

» Stores can deliver goods at a time to suit the customer.

» Supermarkets can remember the customer’s shopping list and favourite

brands, making it easier for the customer to enter their shopping list.

» There is a greater choice of manufacturers. Many main street shops can only

stock items from a few manufacturers.

1 DATA PROCESSING AND INFORMATION

A typical online store website opens with a home page which contains different

categories, on tabs across the top or down the side or on a drop-down menu.

Customers are able to click on a category tab, which takes them to a different

web page on the site. They can browse the products within that category to get

to the one they want. After opening the store website and browsing product

categories, the customer then decides what they want to buy. At this point,

some stores may ask for a postcode or zip code to check that they actually

deliver to that area. In a real store or supermarket, the customer would place

products in a shopping trolley or basket; similarly, with an online store, they

place them in a virtual shopping basket. This is usually done by clicking on an

‘add to basket’ or simply ‘add’ icon. As with a real store, items can be removed

from, as well as added to, a customer’s shopping basket and, when the customer

has decided that they have finished shopping and they want to pay, they go to

the checkout.

Here is an example of the steps, in the form of a simple algorithm, required at

the checkout, although the actual sequence of steps varies depending on the

online store’s website.

1 IF user is new customer, they must register

2 Enter a username

3 Enter and verify (by double entry) password

4 Enter phone number and email address

5 Enter delivery/shipping address

6 Enter billing address if different to shipping address

[some sites only]

7 Choose speed/cost of delivery [some sites only]

8 Enter type of card and credit/debit card number

9 Enter date of expiry

10 IF user is existing customer

11 Log on by entering username and password

12 Confirm delivery address

13 Choose speed/cost of delivery [some sites only]

14 Select credit card/debit card account to be debited if stored,

otherwise enter type of card and credit/debit card number

15 Enter card security code [some sites may not require this]

16 Confirm order

An email address is nearly always needed, so the store can notify the customer

when the order has been received. Some sites inform you of the progress of the

order’s delivery.

The delivery address is needed so the store knows where the goods will be sent to.

The billing address is required because the store wants to know where to send

the bill as this is sometimes different to where the goods will be delivered. As

payments are made electronically, this piece of information is largely academic,

but it can be used for additional credit checks.

The customer is often able to choose how quickly the goods should be delivered

or choose a delivery time slot, if the store has its own delivery vehicles. Some

1.5 Data processing

stores offer same-day or next-day delivery, although usually the quicker the

delivery, the higher the cost.

Often, the customer has to pay a delivery charge as well as the price of the

goods.

1.5.3 Real-time processing

Real-time processing is an example of online processing in that it requires

the inputs to go directly to the CPU of the computer, but the response time

of the computer must be immediate, with no delay whatsoever. This type of

processing is usually found in systems that use sensors, for example computer-

controlled

greenhouses, often referred to as glasshouses, which we will be

looking at in more detail in Chapter 3. Temperature, light and moisture sensors

are all used to monitor physical variables and send these as input data to the

computer so that it can take immediate action. For example if the temperature

falls below a certain value, the heater is automatically switched on. If a batch-

processing system was used, the temperature might have been below the

required value for a long period of time, damaging or even killing the plants

inside, so it is essential that the response from the computer is immediate. Real-

time processing is continuous so the process is never ending unless the user

switchesthe system off.

Uses of real-time processing

Real-time systems are systems where the output affects the input. Consider the

glasshouse example above, where the input to the computer or microprocessor

is the temperature. If the temperature is below the required level, the computer

turns the heater on. How this is achieved will be described in greater detail in

Chapter 3. The temperature will now rise and because this process is continuous

the temperature sensor will feed the new temperature back to the computer.

Here, the output, which is the switching on of the heater, has affected the

input, the temperature. This cycle continues until the desired temperature is

reached. Unlike other online systems, any output is produced quickly enough so

that it affects the system before the next input is received. The output happens

immediately. We will now look at three uses of real-time systems.

Central-heating systems

Boiler

Touchscreen

or keypad with

temperature

sensor

Microprocessor

V Figure1.2Atypicalcentral-heatingsystem

1 DATA PROCESSING AND INFORMATION

In the system shown in Figure 1.2, there is what is called a combination or

combi boiler. The boiler contains the pump as well as the means of heating

the water. A central-heating system is a real-time system since it involves the

use of sensors; in this case temperature sensor(s) are used in central heating to

continuously monitor the physical variable, the temperature of the house. As

with any real-time system, it involves the use of a feedback loop. This is called a

‘closed system’ in that the temperature is fed into the system. The boiler heats

the water, which causes the temperature to rise. This in turn, eventually,causes

the boiler to be switched off by the microprocessor, which then results in a drop

in temperature and the boiler has to come back on again and so the sequence is

repeated. We know it is a real-time system since the output affects the input.

In a microprocessor-controlled central-heating system, users select their

required temperature using a touchscreen or keypad. The microprocessor reads

the data from a temperature sensor on the wall and compares it with the value

the user has requested. If it is lower, then the microprocessor switches the boiler

and the pump on. If it is higher, the microprocessor switches both off. In order

for the microprocessor to process the data from the temperature sensor (which

is an analogue sensor), it uses an analogue-to-digital converter to convert this

data into a digital form that it can understand. As a result of the input, the

microprocessor may or may not send signals to

actuators which open the gas

valves in the boiler and/or switch the pump on.

The microprocessor is also used to control the times at which the system

switches itself on and off. For example, users can set the system to come on

before they get up in the morning and set it to switch off just before they go out

to work. When the system is off, the microprocessor ignores all readings.

A simple algorithm shows how the system works:

1 User enters required temperature using keypad/

touchscreen

2 Microprocessor stores required temperature as a

pre-set value

3 Microprocessor receives temperature from sensor

4 Microprocessor compares temperature from sensor to

pre-set value

5 If temperature from the sensor is lower than the

pre-set value, the microprocessor sends a signal to

an actuator to open the gas valves

6 If temperature from the sensor is lower than the

pre-set value, the microprocessor sends a signal to an

actuator to switch the pump on

7 If temperature is higher than or equal to the pre-set

value, the microprocessor sends a signal to switch

the pump off and close the valves

8 This sequence is repeated until the system is

switched off

1.5 Data processing

Air-conditioning systems

Air-conditioning systems are more sophisticated than central-heating systems

and involve units such as valves, compressors, condensing units and evaporating

units, which together form a system that, basically, feeds cold air into a room

via a unit containing enclosed fans to circulate it around the room. We shall just

concern ourselves with what happens in one room in a house. Each room has a

temperature sensor and the system uses this to determine whether it needs to

switch the fans on or, in the case of more complex systems, change the speed

of the fans (we will only be considering the simpler version). The user, as with

a central-heating system, enters the required temperature using a keypad or

touchscreen.

Again, we can use a basic algorithm to describe the process:

1 User enters required temperature using keypad/

touchscreen

2 Microprocessor stores required temperature as a

pre-set value

3 Microprocessor receives temperature from sensor

4 Microprocessor compares temperature from sensor to

pre-set value

5 If temperature from the sensor is higher than the

pre-set value, the microprocessor sends a signal to an

actuator to switch the fans on

6 If temperature is lower than or equal to the

pre-set value the microprocessor sends a signal to

an actuator to switch the fans off

7 This sequence is repeated until the system is

switched off

Guidance system for rockets

A guidance system can be used to control the movement of different types of

vessel or moving object, such as a ship, aircraft, missile, rocket or satellite. It

involves the process of calculating the changes in position and velocity and

other more complex variables, as well as controlling the object’s course and

speed.

A guidance system, like all computer systems, has inputs, processing, and

outputs. The inputs include data from sensors, the course set by the controller

and data from radio and satellite links. The processing involves using all the

input data to decide what actions, if any, are necessary to maintain or achieve

the required course. The outputs are the actions decided upon as result of the

processing and use devices such as turbines, fuel pumps and rudders, among

others, to change or maintain the course.

Missiles have a high-precision, real-time guidance system built into their nose.

The guidance system has a radar system, which consists of a radar which looks

forwards and one that looks downwards. There is a navigation system and a

‘divert-and-attitude-control’ system, which is able to increase or decrease the

amount of thrust in different engines driving the missile and thereby control

1 DATA PROCESSING AND INFORMATION

the direction and speed of the missile. Basically, the radar scans the surrounding

terrain and feeds the information into the navigation system, which then uses

the data from the radar together with its stored maps of the area to calculate the

required flight path, avoiding obstacles. The new flight path is immediately fed

in to the divert-and-attitude-control system so that it can adjust the direction

and speed of the missile to achieve this flight path. The flight path is constantly

adjusted to ensure the missile does not drift off its projected flight path. A

missile guidance system is an example of a true real-time system, because there

must be no delay between the navigation system realising that the flight path

has been deviated from and the flight path being adjusted.

1.5.4 Advantages and disadvantages of different methods

of processing

It is important to give both sides of the argument and to take into account the

specific context and the user’s needs when comparing methods.

While performing real-time processing, there is no significant delay in response

whereas with batch processing the processing can take place well after the initial

inputs have been entered. In real-time processing, information is always up to

date, so the computer or microprocessor is able to take immediate action. With

batch processing, the information is only up to date after the master file has

been updated by the transaction file. Because real-time processing occupies the

CPU constantly it can be very expensive, unlike batch processing which only

uses the CPU at less busy times. Real-time processing needs expensive, complex

computer systems whereas lower specification computers can suffice for batch-

processing systems. Data is collected instantaneously with real-time systems

whereas gathering of data for batch-processing systems can take a long time

occupying more people so leading to greater expense in wages.

Comparing real-time systems to online systems, it is easier to maintain and

upgrade online processing systems as computers are less busy overnight, so

shutting down banking or shopping systems is less of a problem at less busy

times, unlike real-time systems which do not have less busy times.

Comparing online systems to batch-processing systems, the extra hardware

requirements, input devices, workstations and so on of a large online processing

system can make it more expensive than batch processing. In a company

using an online processing system, each transaction requires the entry of

information immediately. This means that salespeople and other employees

must be connected to the system at all times, unlike batch processing which

only requires a limited number of employees to enter all the data at once. This

leads to savings in wages. An advantage of batch processing compared to online

processing is that it is less expensive than online input as it uses very little

computer processing time to prepare a batch of data or transaction file, and

this can be done at a time convenient to the employees responsible for entering

the data. In online processing errors are revealed, and can be acted upon,

immediately, compared to a batch processing system, where if there is an error it

is only revealed when processing takes place. This can be overnight and as there

may be no human involvement at this stage, management may not be aware of

it until some time after the error actually occurred. Batch processing can be

carried out overnight when the computer would not normally be used. This

means that a company can get more work out of its computer hardware.

1.5 Data processing

Interrogative databases are not well suited to batch processing as details may

be required. Consider the case where a company uses batch processing for

employee details, payroll and so on. An employee may have an accident and so

a manager needs to contact the employee’s spouse immediately. It would be

pointless running this query overnight. The system must be available whenever

necessary so that the business can function properly.

Most modern business systems use a mixture of batch and online processing,

in order to overcome some of the disadvantages described and benefit from the

advantages.

Examination-style questions

1 A collection of data could be this: johan, Σ, $, ,, AND

Explain why these are regarded as just items of data. In your explanation give

a possible context for each item of data and describe how the items would

then become information.

[5]

A company uses computers to process its payroll, which involves updating

a master file.

a State what processes must happen before the updating can begin. [2]

Describe how a master file is updated using a transaction file in a

payroll system. You may assume that the only transaction being carried

out is the calculation of the weekly pay before tax and other deductions.

[6]

Name and describe the purpose of three validation checks other

than a presence check.

[3]

Explain why a presence check is not necessary for all fields. [3]

A space agency controls rockets to be sent to the moon.

Describe how real-time processing would be used by the agency.

[5]

Describe three different methods used to carry out verification. [3]

L12345 is an example of a student identification code.

Describe two appropriate validation checks which could be applied to

this data.

[2]

Describe three drawbacks of gathering data from direct data sources. [3]