1
1.1 Data and information
1
Data processing and
information
1.1 Data and information
Before we consider data processing, we need to define the term data. To be
completely accurate, the word ‘data’ is the plural of ‘datum, a single piece of
data. Often, however, we use data in both the singular and the plural senses.
It seems awkward to say ‘the data are incorrect’ so we tend to say ‘the data is
incorrect’. When we use the word ‘data, it can mean many different things. A lot
of people frequently confuse the terms ‘data’ and ‘information. For the purposes
of this course we will consider data to be what is usually known as ‘raw’ data.
Data can take several forms; it can be characters, symbols, images, audio clips,
video clips and so on, none of which, on their own, have any meaning.
It is important for you to learn what the term
information means when we use
it in information technology. Information is data that has been given meaning,
which often results from the processing of data, sometimes by a computer. The
processed data can then be given a context and have meaning.
The difference between data and information is that data has no meaning,
whereas information is data which has been given meaning.
Here are some examples:
Sets of data:
110053, 641609, 160012, 390072, 382397, 141186
01432 01223 01955 01384 01253 01284 01905 01227 01832 01902 01981
01926 01597
σωρ F m a
In this chapter you will learn:
+ what is meant by the terms ‘data’ and
‘information and about their use
+ what is meant by the terms ‘direct’ and
‘indirect’ data and about their uses and sources
+ what is meant by quality of information
+ what is meant by encryption, why it is
needed, and about the methods and uses of
encryption and protocols
+ about the methods and uses of validation and
verification
+ aboutthemethodsandusesofdifferent
methods of processing (batch, online, real-time)
+ howtowriteasimplealgorithm.
Before starting this chapter you should:
+ be familiar with the terms ‘observation’, ‘interviews’, questionnaires’,
central processing unit (CPU)’, chip and PIN’, ‘direct access’,
encryption, file’, ‘key field’, ‘RFID’, sort’, ‘validation and ‘verification.
2
1 DATA PROCESSING AND INFORMATION
1
These are sets of data which do not have a meaning until they are put into
context.
If we are told that 110053, 641609, 160012, 390072, 382397, and 141186 are
all postal codes in India (a context), the first set of data becomes information as
it now has meaning.
Similarly, if you are informed that 01432, 01223, 01955, 01384, 01253,
01284, 01905, 01227, 01832, 01902, 01981, 01926, and 01597 are telephone
area dialling codes in the UK, they can now be read in context and we can
understand them, as they now have a meaning.
The final set of data seems to be letters of the Greek alphabet apart from F, m
and a, which are letters in the Latin alphabet. However, if we are told that the
context is mathematical, scientific or engineering
formulae, we can see that they
all represent a different variable: σ represents standard deviation, ω represents
angular velocity, ρ is density, F is force, m is mass and a is acceleration. They
each now have meaning.
1.1.1 Data processing
On a computer, data is stored as a sequence of binary digits (bits) in the form of
ones and zeros. We will discuss bits later in this chapter, when we look at parity
checks. We can store data on a fixed or removeable media such as hard-disk
drive, solid-state drive, DVDs, SD cards, memory sticks or in R AM. Data is
usually processed for a particular purpose, often so that it can be analysed. The
computer processing involved uses different operations to produce new data
from the source data. You will, perhaps, have met this in previous practical work
you have done, where you may have been given source files, including
.csv files.
You may have been asked to open these in a spreadsheet and add formulae. This
is the processing of that data so that it then has meaning.
To sum up, data is input, stored, and processed by a computer, for output
as usable information. Later in this chapter we will look at different types of
processing.
Activity 1a
Explain the difference between data and information.
1.1.2 Direct and indirect data
Direct data is data that is collected for a specific purpose or task and is used
for that purpose and that purpose only. It is often referred to as ‘original
source data. Examples of sources of direct data are questionnaires, interviews,
observation, and
data logging.
Indirect data is data that is obtained from a third party and used for a different
purpose to that which it was originally collected for and which is not necessarily
related to the current task. Examples of sources of indirect data are the electoral
register and businesses collecting personal information for use by other
organisations (third parties).
Direct data sources
Direct data sources are sources that provide the data gatherer with original
data. We will consider four such sources.
3
1.1 Data and information
1
Questionnaires
A questionnaire consists of a set of questions, usually on a specific subject
or issue. The questions are designed to gather data from those people being
questioned. A questionnaire can be a mixture of what are called closed
questions (where you have to choose one or more answers from those
provided) and open-ended questions (questions where you can write in your
answers in more detail). Questionnaires are easy to distribute, complete and
collect as most people are familiar with the process. They can be completed on
paper or on computer.
Interviews
An interview is a formal meeting, usually between two people, where one of
them, the interviewer, asks questions, and the other person, the interviewee,
answers those questions. Interviews are used to collect data about a topic
and can be structured or unstructured. Structured interviews are similar to
a questionnaire, whereby the same questions are asked in the same order for
each interviewee and with a choice of answers. Unstructured interviews can be
different for each interviewee, particularly as they give them the opportunity
to expand on their answers. There is usually no pre-set list of answers in an
unstructured interview.
Observation
Observation is a method of data collection in which the data collectors
watch what happens in a given situation. The observer collects data by seeing
for themselves what happens, rather than depending on the answers from
interviewees or the accuracy of completed questionnaires.
Data logging
Data logging means using a computer and sensors to collect data. The
data is then analysed, saved and the results are output, often in the form of
graphs and charts. Data logging systems can gather and display data as an
event happens. The data is usually collected over a period of time, either
continuously or at regular intervals, in order to observe particular trends. It
involves recording data from one or more sensors and the analysis usually
requires special software. Data logging is commonly used in scientific
experiments, in
monitoring systems where there is the need to collect
information faster than a human possibly could, in hazardous circumstances
such as volcanoes and nuclear reactors, and in cases where accuracy is
essential. Examples of the types of information a data logging system can
collect include temperatures, sound frequencies, light intensities, electrical
currents, and pressure.
Uses of direct data
An example of a use of direct data could be planning the alteration of a bus
route. A committee of residents on a new housing development, just outside
a local village, wants a bus company to re-route the bus service from the local
village to the town centre so that residents on the new development are able
to get to the town centre more easily. It will, however, involve the bus route
running through open countryside near the village. In order to persuade the
bus company to change the bus route, the committee will need to collect some
original data to present to them.
4
1 DATA PROCESSING AND INFORMATION
1
This data will include:
» How long it takes to walk from the new development to the existing bus
routes.
» The number of passengers who use the existing route.
» The number of passengers who would use the new route.
» The effect the villagers think the changed route would have on their daily
lives.
Here are some examples of how data could be collected.
» How long it takes to walk from the new development to the existing bus
routes.
Original data could be collected by actually walking from various points in
the new development and timing how long it would take. This might not be
practical if several points on the new development have to be considered, given
the time it would take to measure all the possible walking times.
» The number of passengers who use the existing route.
The suggested method to be used is a data-logger. Sensors fitted around the
door of each bus could be used to count the numbers of passengers boarding
and getting off at each stop. From these it can be calculated how many
passengers are on the bus at any point along its route. The data would be fed
back to a data-logger or a tablet computer.
» The number of passengers who would use the new route.
In order to save time, questionnaires could be used. People living on the new
development would be asked to complete the questionnaires. The completed
questionnaires could then be transferred to a computer. Provided that the
questionnaires were completed honestly, an accurate assessment of how many
passengers would use the new route could be obtained.
» The effect the villagers think the changed route would have on their daily
lives.
In order to ensure completely honest responses, face-to-face interviews would be
best. The disadvantages of interviews are the length of time the process would
take and the potential difficulties of transferring the responses into a format that
a computer could deal with. However, because the interviewer can add follow-
up questions, the answers would be more accurate.
Indirect data sources
Indirect data sources are third-party sources that the data gatherer can obtain
data from. We will consider two such sources.
Electoral register
The
electoral register, also referred to as the electoral roll, is an example
of an indirect data source. It is a list of adults who are entitled to vote in an
election. Some countries have an ‘open’ version of the register which can be
purchased and used for any purpose. Electoral registers are used in countries
such as the USA, the UK, Australia, New Zealand, and many others. They
contain information such as name, address, age and other personal details,
although individuals are often allowed to remove some details from the open
5
1.1 Data and information
1
version. In many countries,the full register can only be used for limited
purposes specified by the law in that country. The personal data in the
register must always be processed in a way that complies with the country’s
data protection laws.
Businesses collecting data from third parties
Businesses collect a great deal of personal information from
third parties,
such as their customers, when they sell them a product. Whenever they buy
something online, customers have to enter personal information or they have
already done this on a previous visit to that business’s website. It is often the
case that they agree to the business sharing this with other organisations.
Another development, in this regard, has been the emergence of data brokers.
These are companies that collect and analyse some of an individual’s most
sensitive personal information and sell it to each other and to advertisers and
other organisations without the individuals knowledge. They usually gather it
by following peoples internet, social media, and mobile phone activity.
Uses of indirect data
Apart from elections and other government purposes, the electoral register
can only be used to select individuals for jury service or by credit reference
agencies. These agencies are allowed to buy the full register to help them check
the names and addresses of people applying for credit. They are also allowed to
use the register to carry out identity checks in an attempt to deal with money
laundering. It is a criminal offence for anyone to supply or use the register for
anything else. The open register, however, can have various uses; businesses can
use it to perform identity checks on their customers, charities often use it for
fundraising purposes, debt-collection agencies use it to track down people who
owe money. Whenever the address of an individual is required, a business could
use the open register to check it.
Businesses which collect personal information often use it to create mailing lists
that they then sell to other organisations, which are then able to send emails or
even brochures through the post.
These examples are not the only type of indirect data source. Any
organisation that provides data or information to the general public for
use by them can be said to be an indirect source. In the bus route example
described above, an indirect source could be used to provide some of the
required information. For instance, the timetable of the current bus service
could be used by the committee to work out the number of passengers using
the route by seeing how many times the bus runs during the day. However,
this would not be very accurate, as the buses may not carry a full load of
passengers each time and this is clearly not the purpose for which the data
was intended.
Another scenario could be studying pollution in rivers. Direct data sources
could be used, of couse; questionnaires could be handed out to local
landowners and residents in houses near to the river, asking about the effects
on them of the pollution, and they could also be interviewed. Computers
with sensors could be used to collect data from the river. However, indirect
data sources could also be used; documents may have been published by
government departments showing pollution data for the area and there may be
environmental campaigners who have also published data related to pollution
in the area.
6
1 DATA PROCESSING AND INFORMATION
1
Advantages and disadvantages of direct and indirect data
Here is a table showing the advantages and disadvantages of direct data when
compared to indirect data. Notice how each paragraph contains comparisons:
Advantages of direct data Disadvantages of direct data
We know how reliable direct data is since
we know where it originated. Where data
is required from a whole group of people,
we can ensure that a representative
cross-section of that group is sampled.
With indirect data sources we may not
know where the data originated and it
could be that the source is only a small
section of that group, rather than a
cross-section of the whole group. This is
often referred to as sampling bias.
Because of time and cash restraints, the sample or
group size may be small whereas indirect data sources
tend to provide larger sets of data that would use up
less time and money than using direct data collection
with a larger sample size.
The person collecting the data may not be able to
gain physical access to particular groups of people
(perhaps for geographical reasons), whereas the use of
indirect data sources allows data from such groups to be
gathered.
In addition, using a direct data source could be
problematic if the people being interviewed are not
available thus reducing sample size, whereas using
indirect data sources allows the sample size to be greater
resulting in increased confidence in the results produced.
The person collecting the data can use
methods to gather specific data even if
the required data is obscure, whereas with
indirect data sources this type of data
may never have been collected before.
It may not be possible to gather original data due to
the time of year; for example summer rainfall data
may be needed but at the time of the data-gathering,
it is winter. With indirect data, historical weather data
is available irrespective of the time of year.
The data collector or gatherer only needs
to collect as much or as little data as
necessary compared to indirect data
sources, where the original purpose for
which data was collected may be quite
different to the purpose for which it is
needed now. Irrelevant data may need to
be removed.
To gather data from a specific sample would take a lot
longer than it would with indirect data. In addition,
by the time all the required data has been collected it
may possibly be out of date so an indirect data source
could have been used.
Indirect data may be of a higher quality as it
might have already been collated and grouped into
meaningful categories whereas with direct data
sources, questionnaire answers can sometimes be
difficult to read and the transcripts of interviews take
time to read in order to create the data source.
Once the data has been collected it
may be useful to other organisations
and there may be opportunities to sell
the data to them, reducing the expense
of collection. With indirect data this
opportunity will probably not arise as
organisations can go direct to the source
themselves.
Compared to indirect data sources, the collection
of data may be more expensive than using an
indirect data source as people may have to be paid
to collect it. Extra cost may be incurred as special
equipment has to be bought, such as data-loggers
and computers with sensors, or purchasing the paper
for questionnaires, whereas this would not be needed
using an indirect source. There are, however, still
costs involved when using indirect data sources, such
as the travelling expenses and time taken to go to
the source, which can be fairly expensive but not as
expensive as using direct data sources.
V Table 1.1 Advantages and disadvantages of direct and indirect data
Activity 1b
1 Explain why observation is considered to be a direct data source.
2 Give two differences between indirect data and direct data.
7
1.2 Quality of information
1
1.2 Quality of information
Measuring the quality of information is sometimes based on the value which
the user places on the information collected. As such it could be argued that
the judgement regarding the quality of information is fairly subjective, that is,
it depends on the user and such judgements can vary between users. However,
many experts do suggest that these judgements can be objective if based on
factors which are believed to affect the quality of information.
Poor quality data can lead to serious consequences. Poor data may give a
distorted view of business dealings, which can then lead to a business making
poor decisions. Customers can be put off dealing with businesses that give poor
service due to inaccurate data, causing the business to get a poor reputation.
With poor quality data it can be difficult for companies to have accurate
knowledge of their current performance and sales trends, which makes it hard
for them to identify worthwhile future opportunities.
One example can be seen in the data provided by a hospital in the UK, which
resulted in it being temporarily closed down, until it was realised that the death
rate data provided had been incorrect and it was actually significantly lower.
Meanwhile, in the USA, incorrectly addressed mail costs the postal service a
substantial amount of money and time to process correctly.
Some of the factors that affect the quality of information are described here.
1.2.1 Factors that affect the quality of information
Accuracy
As far as possible, information should be free from errors and mistakes. The
accuracy of information often depends on the accuracy of the collected
data before it is processed. If the original data is inaccurate then the resulting
information will also be inaccurate. In order to make sure the information is
accurate, a lot of time needs to be given to the collection and checking of the
data. Mistakes can easily occur. Consider a simple stock check. If it is carried
out manually, a quantity of 62 could easily be copied down as 26 if the digits
were accidentally transposed. This information is now inaccurate. More careful
checking of the data might have prevented this.
We will look at methods of checking the accuracy of data, such as
verification and
validation, later in this chapter. It is easy to see how errors might occur during the
data collection process. When using a direct data source, if we have not made the
questions clear then the people answering the questionnaires or being interviewed
may not understand them. We need to make sure that questions are clearly phrased
and are unambiguous, otherwise they might lead interviewees into providing the
answers that they think the interviewer is expecting. This can lead to the same
response being given by everyone, even though the question is open-ended. If
the questions are too open-ended, it could be difficult to quantify the responses.
It is often a good idea to include multiple-choice questions where the respondent
chooses an answer from those provided. These can be quantified quite easily. It is
important, however, to include a sufficient number of alternative answers.
Other reasons why the information derived from a study might be inaccurate
are that the sample chosen is not representative of the whole group or that the
data collector makes some errors when collecting or when entering the data into
a computer. If sensors are being used, these must be calibrated before use and
must be properly connected to the computer. In addition, the computer system
needs to be set up correctly so that the readings are interpreted correctly.
8
1 DATA PROCESSING AND INFORMATION
1
Relevance
When judging the quality of information, we need to consider the data that is
being collected. Relevance is an important factor because there has to be a good
reason why that particular set of data is being collected. Data captured should be
relevant to the purposes for which it is to be used. It must meet the requirements
of the user. The question needs to be asked is the data really needed or is it
being collected just for the sake of it. The relevance of data matters because the
collection of irrelevant data will entail a waste of time as well as money.
There are a number of ways in which the data may or may not be relevant to the
user’s needs. It could be too detailed or concentrate too much on one aspect.
On the other hand, it might be too general, covering more aspects of the task
than is necessary. It may relate to geographical areas that are not really part of
the study. Where the study is meant to be about pollution in a local area, for
example, data from other parts of the country would not be relevant. When
looking for relevant information, it is important to be clear about what the
information needs are for each specific search.
It is also necessary to be clear about the search strategy: what the user wants
and does not want to find and therefore what the user needs to look for. In
an academic study, it is important to select academic sources. Business sources
or sources which appear to have a vested interest should be ignored. Having
selected the sources, it is important to select the relevant information within
them. Consider a school situation. You need to study a tremendous amount of
information to prepare for your exams. How would you feel if your teachers
chose to spend several lessons talking about aspects of the subject that they
found really interesting? You may find that it was very interesting, but it
probably would not be very relevant to what you need to pass your course.
Age
How old information is can affect its quality. As well as being accurate and
relevant, information needs to be up-to-date. Most information tends to change
over time and inaccurate results can arise from information which has not been
updated regularly. This could apply, for example, to personal information in a
database being left unchanged. Someone could get married and have a baby.
If the original information was used, which had the person as single with no
dependants, this would produce inaccurate results if the person was applying for
a loan. This is because people who are married with children tend to be viewed
as being more responsible and more likely to keep up with repayments. This
inaccurate information would also affect a retailer’s targeted advertising if it
wanted to sell baby products to such customers, as the person would not appear
on its list of targets. The age of information is important, because information
that is not up-to-date can lead to people making the wrong decisions. In turn,
that costs organisations time, money, and therefore, profits.
Level of detail
For information to be useful, it needs to have the right amount of detail.
Sometimes, it is possible for the information to have too much detail, making it
difficult to extract the information you really want. Information should be in a
form that is short enough to allow for its examination and use. There should be
no extraneous information. For example, it is usual to summarise statistical data
and produce this information either in the form of a
table or using a chart. Most
people would consider a chart to be more concise than data in tables, as there is
little or no unnecessary information in a chart. A balance has to be struck between
the level of detail and conciseness. Suppose a car company director wants to see
9
1.3 Encryption
1
a summary of the sales figures of all car models for the last year; the information
with the correct level of detail would be a graph showing the overall figures for
each month. If the director was given figures showing the sales of each model
every day of the previous 12 months in the form of a large report, this would be
seen as the wrong level of detail because it is not a summary. It is important to
understand what the user needs when they ask you for specific information.
On the other hand, the information might not have enough detail, meaning
that you do not get an overall view of the problem. This links closely to the
issue of the completeness of the information, which we will look at next.
Completeness of the information
In order for information to be of high quality it needs to be complete. To be
complete, information must deal with all the relevant parts of a problem. If it is
incomplete, there will be gaps in the information and it will be very difficult to
use to solve a particular problem or choose a certain course of action. Discovering
and collecting the extra information in order to remove these gaps may result in
improving the quality of the information, but can prove to be time-consuming.
Therefore, if the information is not complete, a decision has to be made: either
that it is complete enough to make a decision about a problem or that additional
data needs to be collected to complete the information. Consider the car company
director mentioned above who wants to see a summary of the sales figures for
the last year. If the director was given figures showing the sales for the first six
months, this would be incomplete. If the director was shown the figures for only
the best-selling models, this would be incomplete. It is important to understand
what the user needs when they ask you for specific information. To sum up,
completeness is as necessary as accuracy when inputting data into a database.
Activity 1c
1 List two factors that affect the quality of information.
2 Briefly describe what is meant by the quality of information.
1.3 Encryption
1.3.1Theneedforencryption
Whenever you send personal information across the internet, whether it is credit
card information or personal details, there is a risk that it can be intercepted.
Once it is intercepted the information can be changed or used for purposes such
as
identity theft, cyber-fraud, or ransomed off. If it is information regarding a
company’s secrets, it could be sold by
hackers to rival companies. If, however,
the information is intercepted but it is unreadable or cannot be understood, it
becomes useless to the hacker or interceptor. Too many companies or individuals
become victims of hackers taking advantage of readily available usernames
and
passwords. No matter how vigilant we are regarding the security of our
computer systems, hackers will always find a way of getting into them, but if
they cannot decipher the information, it will mean the act of hacking is not
worthwhile. This is where encryption comes in. Encryption keeps much of our
personal data private and secure, often without us realising it. It prevents hackers
from reading and understanding our personal communications and protects
us when we bank and shop. Data is scrambled or jumbled up so that it is
completely unreadable. This prevents hackers understanding the data, as all they
see is a random selection of letters, numbers and symbols.
10
1 DATA PROCESSING AND INFORMATION
1
Encryption is a way of scrambling data so that only authorised people can
understand the information. It is the process of converting information into a
code which is impossible to understand. This process is used whether the data is
being transmitted across the internet or is just being stored. It does not prevent
cyber criminals intercepting sensitive information, but it does prevent them
from understanding it. Technically, it is the process of converting
plaintext to
ciphertext.
It is not just personal computers that are affected; businesses and commercial
organisations are also liable to be affected by hacking activities. Employing data
encryption is a safe way for companies to protect their confidential information
and their reputation with their clients, since the benefits of encryption do not
just apply to the use of the internet.
Information should also be encrypted on computers, hard-disk drives, pen
drives and portable devices, whether they be laptops, tablets, or smartphones.
The misuse of the data on these devices will be prevented, should the device be
hacked, lost orstolen.
1.3.2 Methods of encryption
Encryption is the name given to converting data into a code by scrambling it,
with the resulting symbols appearing to be all jumbled up. The algorithms (we
will be looking at the topic of algorithms in much more detail in Chapter 4)
which are used to convert the data are so complex that even the most dedicated
hacker with plenty of time to spare and hacking software to help them would be
extremely unlikely to discover the meaning of the data. Encrypted data is often
called ciphertext, whereas data before it is encrypted is called plaintext.
The way that encryption works is that the computer sending the message
uses an
encryption key to encode the data. The receiving computer has a
corresponding decryption key that can translate it back again. The process of
decryption here is basically reversing the encryption. A key is just a collection of
bits, often randomly generated by a computer. The greater the length of the key,
the more effective the encryption. Many systems use 128-bit keys which gives
2
128
different combinations. It has been estimated that it would take a really
powerful computer 10
18
(1000000000000000 000 [one quintillion]) years
to go through all the different combinations. Modern encryption uses 256-
bit keys which would take very much longer to crack. As you can imagine, this
makes this form of encryption virtually impossible to crack. The key is used in
conjunction with an algorithm to create the ciphertext.
This is a test
message
which is to
be encrypted.
This is a test
message
which is to
be encrypted.
++
Encryption
algorithm
Encryption key
Decryption
algorithm
Decryption key
Fhk*$r
tldbh6
0)qARt
B!&Dl
Ntf8aL
Kwas7
Plaintext Ciphertext Plaintext
V Figure 1.1 Encryption
There are two main types of encryption. One is called symmetric encryption
and the other is asymmetric encryption, which is also referred to as public-key
encryption.
11
1.3 Encryption
1
Symmetric encryption
Symmetric encryption, often referred to as ‘secret key encryption, involves the
sending computer, or user, and the receiving computer, or user, having the same
key to encrypt and decrypt a message. Although symmetric encryption is a much
faster process than asymmetric encryption, there is the problem of the originator
of the message making sure the person receiving the message has the same
private
key. The originator has to send the encryption key to the recipient before they
can decrypt the message. This, however, leads to security problems, since this key
couldbe intercepted by anybody and used to decrypt the message. Many companies
overcome this problem by using asymmetric encryption to send the secret key but
use symmetric encryption to encrypt data. So, with symmetric encryption both
sender and recipient have the same secret, private, encryption key which scrambles
the original data and unscrambles it back to an understandable format.
Asymmetric encryption
Asymmetric encryption, sometimes referred to as ‘public-key encryption, uses
two different keys, one public and one private. A
public key, which is distributed
among many users or computers, is used to encrypt the data. Essentially, this public
key is published for anyone to use to encrypt messages. A private key, which is only
available to the computers, or users, receiving the message, is used to decrypt the
data. When a message is encrypted using the public key, it can be sent across a public
channel such as the internet. This is not a problem as the public key cannot be used
to decrypt a message that it was used to encrypt. It is incredibly complicated, if not
impossible, to guess the private key using the public key and the encrypted message.
Basically, any user who needs to send sensitive data over the internet securely, can do
so by using the public key to encrypt the data, but the data can only be decrypted
by the receiving computer if it has its own private key. Asymmetric encryption is
often used to send emails and to digitally sign documents.
1.3.3Encryptionprotocols
An encryption protocol is the set of rules setting out how the algorithms should be
used to secure information. There are several encryption protocols. IPsec (internet
protocol security)
is one such protocol suite which allows the authentication of
computers and encryption of packets of data in order to provide secure encrypted
communication between two computers over an internet protocol (IP) network.
It is often used in VPNs (virtual private networks). SSH (secure shell) is another
encryption protocol used to enable remote logging on to a computer network,
securely. SSH is often used to login and perform operations on remote computers,
but it can also be used for transferring data from one computer to another. The
most popular protocol used when accessing web pages securely is
transport layer
security (TLS)
. TLS is an improved version of the secure sockets layer (SSL)
protocol and has now, more or less, taken over from it, although the term SSL/TLS
is still sometimes used to bracket the two protocols together.
The purpose of secure sockets layer (SSL)/transport layer security (TLS)
Because TLS is a development of SSL, the terms TLS and SSL are sometimes
used interchangeably. We will use the term SSL/TLS in this book. The three
main purposes of SSL/TLS are to:
» enable encryption in order to protect data
» make sure that the people/companies exchanging data are who they say they
are (authentication)
» ensure the integrity of the data to make sure it has not been corrupted or altered.
12
1 DATA PROCESSING AND INFORMATION
1
Two other purposes are to:
» ensure that a website meets PCI DSS rules. The Payment Card Industry
Data Security Standard (PCI DSS) was set up so that company websites
could process bank card payments securely and to help reduce card fraud.
This is achieved by setting standards for the storage, transmission and
processing of bank card data that businesses deal with. Later versions of TLS
are required to meet new standards which have been imposed.
» improve customer trust. If customers know that a company is using the SSL/
TLS protocol to protect its website, they are more inclined to do business
with that company.
Many websites use SSL/TLS when encrypting data while it is being sent to
and from them. This keeps attackers from accessing that data while it is being
transferred. SSL/TLS should be used when storing or sending sensitive data
over the internet, such as when completing tax returns, buying goods online,
or renewing house and car insurance. Only going to websites which use SSL/
TLS is good practice. The SSL/TLS protocol enables the creation of a secure
connection between a web server and a browser. Data that is being transferred
to the web server is protected from eavesdroppers (the name given to people
who try to intercept internet communications).
The SSL/TLS protocol verifies the identity of the
server. Any website with an
HTTPS address uses SSL/TLS. In order to verify the identity of the server, the
protocol makes use of digital certificates, which contain such information as the
domain name that the certificate is issued for, which organisation, individual or
device it was issued to, the certificate authority (CA) that issued it, the CAs digital
signature, and the public key, as well as other items. Although SSL was replaced by
TSL many years ago, these certificates are still referred to as SSL certificates today.
As well as keeping the user’s data secure, a website needs a digital certificate in order
to verify ownership of the website and also to prevent fraudsters creating a fake
version of the website. Valid SSL certificates can only be obtained from a CA. CAs
can be private companies or even governments. Before allowing someone to have
an SSL certificate, the CA will carry out a number of checks on an applicant and
following that, it is the responsibility of the CA to make sure that the company or
individual receives a unique certificate. Unfortunately, if hackers are able to break
through a CAs security, they can start issuing bogus certificates to users and will
then be in a strong position to crack the user’s encryption.
The use of SSL/TLS in client–server communication
Transport layer security (TLS) is used for applications that require data to
be securely exchanged over a client–server network, such as web browsing
sessions and file transfers. Just like IPsec it can enable VPN connections and
Voice over IP(VoIP).
In order to open an SSL/TLS connection, a
client needs to obtain the public
key. For our purposes, we can consider the client to be a web user or a web
browser and the server to be the website. The public key is found in the server’s
digital certificate. From this we can see that the SSL/TLS certificate proves that
a client is communicating with the actual server that owns the domain, thereby
proving the authenticity of the server.
When a browser (client) wants to access a website (server) that is secured by
SSL/TLS, the client and the server must carry out an SSL/TLS handshake. A
handshake, in IT terms, happens when two devices want to start communicating.
One device sends a message to the other device telling it that it wants to set up
a communications channel. The two devices then send several messages to each
13
1.3 Encryption
1
other so they can agree on the rules for communicating (a communications
protocol). Handshaking occurs before the transfer of data can take place.
With an SSL/TLS handshake, the client sends a message to the server telling it
what version of SSL/TLS it uses together with a list of the different ciphersuites
(types of encryption) that the client can use. The list of ciphersuites has the client’s
preferred type at the top and its least favourite at the bottom. The server responds
with a message which contains the ciphersuite it has chosen from the client’s list.
The server also shows the client its SSL certificate. The client then carries out a
number of checks to make sure that the certificate was issued by a trusted CA and
that it is in date and that the server is the legitimate owner of the public and private
keys. The client now sends a random
string of bits that is used by both the client
and the server to calculate the private key. The string itself is encrypted using the
server’s public key. Authentication of the client is optional in the process. The client
sends the server another message, encrypted using the secret key, telling the server
that the client part of the handshake is complete. We will see in more detail in the
section on HTTPS how any further transmitted data is encrypted.
1.3.4Usesofencryption
There are many reasons to encrypt data. Companies often store confidential
data about their employees, which could include medical records, payroll data,
as well as personal data. These need to be encrypted to prevent them becoming
public knowledge. An employee in a shared office may not want others to
have access to their work which may be stored on a hard disk, so it needs to be
encrypted. A company’s head office may wish to share sensitive business plans
with other offices using the internet. If the data is encrypted, they do not have
to worry about what would happen if it were intercepted. Company employees
and individuals may need to take their laptops or other portable devices with
them when they travel for work or pleasure. If the device contains sensitive
information which is not encrypted, it is possible that the information could be
retrieved by a third party if the device is left unattended.
As recently as 2018, it was reported that over the previous four years, staff
in five UK government departments had lost more than 600 laptops, mobile
phones and memory sticks. Fortunately, the data had been encrypted and was
therefore protected from prying eyes. Unfortunately, there have been occasions
where laptops have been left on trains and the data was unencrypted, causing
great embarrassment to the government when this was discovered.
There are other situations where encryption should be used. One example is
when individuals are emailing each other with information they would want
to remain confidential. They need to prevent anybody else from reading and
understanding their mail. People use websites for online shopping and online
banking. When doing so, the debit/credit card and other bank account details
should be encrypted to prevent fraudulent activity taking place.
Let us now consider three specific uses of encryption.
Hard-disk encryption
The principle of hard-disk encryption is fairly straightforward. When a file is
written to the disk, it is automatically encrypted by specialised software. When a
file is read from the disk, the software automatically decrypts it while leaving all
other data on the disk encrypted. The encryption and decryption processes are
understood by the most frequently used application software such as spreadsheets,
databases and word processors. The whole disk is encrypted, including data files,
the OS and any other software on the disk. Full (or whole) disk encryption is your
14
1 DATA PROCESSING AND INFORMATION
1
protection should the disk be stolen, or just left unattended. So, even if the disk is
still in the original computer, or removed and put into another computer, the disk
remains encrypted and only the keyholder can make use of its contents.
Another benefit of full disk encryption is that it automatically encrypts the data
as soon as it is saved to the hard disk. You do not have to do anything, unlike
the encryption of files and folders, where you have to individually encrypt them
as you go.
There are, however, drawbacks to encrypting the whole disk. If an encrypted disk
crashes or the OS becomes corrupted, you can lose all your data permanently or,
at the very least, disk data recovery becomes problematic. It is also important to
store encryption keys in a safe place, because as soon as a disk is fully encrypted,
no one can make use of any of the data or software without the key. Another
drawback can be that booting up the computer can be a slower process.
Email encryption
When sending emails, it is good practice to encrypt messages so that their
content cannot be read by anyone other than the person they are being sent
to. Many people might think that having a password to login to their email
account
is sufficient protection. Unfortunately, emails tend to be susceptible to
interception and, if they are not encrypted, sensitive information can become
readily available to hackers. In the early days of email communication, most
messages were sent as plaintext, which meant hackers could easily read their
content. Fortunately, most email providers now provide encryption by default.
There are three parts to email encryption.
1 The first is to encrypt the actual connection from the email provider, because
this prevents hackers intercepting and acquiring login details and reading any
messages sent (or received) as they leave (or arrive at) the email provider’s server.
2 Then, messages should be encrypted before sending them so that even if a
hacker intercepts the message, they will not be able to understand it. They
could still delete it on interception, but this is unlikely.
3 Finally, since hackers could bypass your computer’s security settings, it is
important to encrypt all your saved or archived messages.
Asymmetric encryption is the preferred method of email encryption. The email
sender uses the public key to encrypt the message while the recipient uses the
private key to decrypt it. It is considered good practice to encrypt all email
messages. If only the ones that are considered to be important are encrypted,
the hacker will know which emails contain sensitive data and will therefore
spend more time and energy trying to decode those particular ones. Encryption
only scrambles the message contents, not the sender’s email address, making it
very difficult to send messages anonymously.
Most types of email encryption require the sending of some form of digital
certificates to prove authenticity. The management of digital certificates, though
time-consuming, is crucial, as users would not want them to fall into the hands
of hackers.
Encryption in HTTPS websites
HTTP (Hypertext Transfer Protocol) is the basic protocol used by web browsers
and web servers. Unfortunately, it is not encrypted and so can cause internet
traffic to be intercepted, read and understood. Hackers could intercept any private
information including bank details and then use these to commit fraud. HTTPS
(Hypertext Transfer Protocol Secure), however, enables users to browse the world
wide web securely. To do this, it uses the HTTP protocol but with SSL/TSL
15
1.3 Encryption
1
encryption overlaid. HTTPS websites have a digital certificate issued by a trusted
CA, which means that users know the website is who it says it is; as we mentioned
in the section on SSL/TLS, it proves it is authentic. It again ensures the integrity of
the data by showing that web pages have not been changed by a hacker while being
transferred and by encrypting any information transferred from the client to the
server, and vice versa. As a result of the use of HTTPS, users can securely transmit
confidential information such as credit card numbers, social security numbers and
login credentials over the internet. If they used an ordinary HTTP website, data is
sent as plain text, which could easily be intercepted by a hacker or fraudster.
Indicators that you are using a secure site are the inclusion of the HTTPS:// prefix
as the starting part of the URL. There should also be a padlock icon next to the
URL. Depending on your browser and the type of certificate the website has
installed, the padlock may be green. The way HTTPS works is that the web browser
on the client computer performs a handshake with the web server, as described
earlier in the section on SSL/TLS. Then, what is sometimes called a session key is
created randomly by the web browser and is encrypted using the public key, then
sent to the server. The server then decrypts the session key using its private key.
All data sent between the two from then on is encrypted using this session key.
So, the generation of the session key is done through asymmetric encryption, but
symmetric encryption is used to encrypt all further communications. Asymmetric
encryption requires a lot of processing power and time, so HTTPS uses a
combination of asymmetric and symmetric encryption. Once the session is finished,
each client and the server discard the symmetric key used for that session, so each
separate session requires a new session key to becreated.
One of the benefits of HTTPS, obviously, is security of data, with information
remaining confidential because only the client browser and the server can
decrypt it. Another benefit of HTTPS is that search engine results tend to rank
HTTPS sites higher than HTTP sites. However, the time required to load an
HTTPS website tends to be greater. Websites have to ensure that their SSL
certificate has not expired and this creates extra work for the host as it has to
keep on top of certificate management.
1.3.5Advantagesanddisadvantagesofdifferentprotocols
and methods of encryption
There are a number of advantages of using encryption. As was said earlier, if
personal information is sent across the internet, whether it is credit card information
or personal details, once it is encrypted the information can no longer be changed
or used for purposes such as identity theft or cyber-fraud or ransomed off. Company
secrets would not be able to be sold by hackers to rival companies. One drawback
with encryption, however, is that it takes more time to load encrypted data as well
as requiring additional processing power. When browsing, the client and server
must send messages to each other several times before any data is transmitted. This
increases the time it takes to load a webpage by several milliseconds. It also uses up
valuable memory for both the client and the server. Encryption involves the use of
keys and while a larger key size means more effective encryption, it also increases
the computational power required to perform the encryption. Encryption is meant
to protect data, but it can also be the source of great inconvenience to a computer
user.
Ransomware can be used against individual computer users; hackers can
encrypt computers and servers and then demand a ransom. If the ransom is paid,
they provide a key to decrypt the encrypted data. Another problem with encryption
is that if the private key is lost, it is extremely difficult to recover the data and, in
certain circumstances, the data may well be lost permanently. It is possible for the
16
1 DATA PROCESSING AND INFORMATION
1
data to be recovered by the reissuing of the digital certificate, but this can take time
and, in the meantime, if a hacker has managed to get hold of the key, they will
have full access to the encrypted data. In addition, users can get careless and forget
that decrypted data should not be left in the decrypted state for too long, as it then
becomes susceptible to further attack from hackers.
Regarding the different forms of encryption, asymmetric is much slower compared
to symmetric due to its mathematical complexity, and it is therefore not suitable for
computing vast amounts of data. It also requires greater computational power.
While it is difficult to give the advantages and disadvantages of individual
protocols since they all do a similar job, it is possible to compare the advantages
and disadvantages of SSL/TLS with IPsec for setting up a VPN, for example.
With SSL/TLS, digital certificates are only essential with the server (client
digital certificates are optional), whereas with IPsec both client and server have
to be authenticated, which makes it more difficult to manage an IPsec system.
However, as user authentication is optional, this means that security is weakened
with SSL/TLS compared to IPsec.
Extra software has to be downloaded when using SSL/TLS if non-web-based
applications are used, which may be a problem if a
firewall prevents or slows
down access to these downloads. VPN tunnels using SSL/TLS are not supported
by certain operating systems which do support IPsec. In conclusion, the time-
consuming management of digital certificates is less of a problem with SSL
compared to IPsec, which could lead to saving of money. Another cost implication
is that, unlike most uses of IPsec, you do not need to buy client software and the
process of setting up and managing such a system tends to be easier.
Activity 1d
1 Briefly describe what is meant by symmetric encryption.
2 Write a sentence about each of the three uses of encryption.
3 Explain the difference between HTTP and HTTPS.
1.4 Checking the accuracy of data
1.4.1 Validation and verification
It is vital that data is input accurately to ensure that the processing of that data
produces accurate results. When using a computer system, data entry is probably
the most time-consuming part of data processing. Consequently, it is important
to try and ensure that the number of errors which occur when entering data
directly or transferring from another medium is very small, otherwise more
time will need to be spent correcting the data or even re-entering it. To try
and ensure that data entry is accurate, we use two methods called validation
and verification. Neither method ensures that the data being entered is correct
or is the value that the user intended to enter. Verification tries to ensure that
the actual process of data entry is accurate, whereas validation ensures the
data values entered are reasonable. It may well be that the original data when
collected was incorrect; here we are just ensuring that no further errors occur
during the transfer process.
Methods of validation
As stated above, data validation simply ensures that data is reasonable and sensible.
In the early days of computerisation there were some horror stories of people
getting utility bills for 1 million dollars! This was usually the result of poor
17
1.4 Checking the accuracy of data
1
checking being carried out and nobody noticing that the decimal point had been
put in the wrong place. A validation check that ensured nobody could have a bill
of more than $1000 would have stopped this. However, it would not prevent
somebody getting an incorrect bill of $321 instead of $231, as the amount being
charged would still be regarded as sensible or reasonable. To emphasise then,
validation ensures data is reasonable or sensible but not necessarily correct.
In order to ensure that data input to a system is valid, it is essential to
incorporate as many validation routines as possible. The number and methods of
validation routines or checks will obviously depend on the form or type of input
to the system. Not every
field can have a validation check. For example, there
are so many variations of an individuals name that this would be very difficult
to validate. Some have letters that might not be recognised by a computer using
an alphabet based on the English language and some contain punctuation
marks such as apostrophes and hyphens.
In order to illustrate the types of validation, let us consider a school library
database which has one table for books and another table for upper-school
students/borrowers. Let us look at an extract showing some of the typical
books and borrowers from the complete database. We shall assume that these
records are representative of the whole database.
ISBN Title Author Published Cost
9781474606189 The Labyrinth of the Spirits Carlos Ruiz Zafón 2016 $10
9780751572858 Lethal White Robert Galbraith 2018 $18
9781780899329 18th Abduction James Patterson 2019 $29
9781408711095 The Colours of All the Cattle Alexander McCall Smith 1997 $26
V Table1.2Booksdatabasetable
Borrower_ID Name Date_of_birth Class Book_borrowed
0205 Chew Ming 21/12/04 11D 9781474606189
1016 Gurvinder Sidhu 19/11/05 10A 9781408711095
0628 Gary Goh 18/04/06 10C 9781474606189
1014 Jasmine Reeves 13/02/05 11A 9780751572858
V Table1.3Years10and11upper-schoolborrowersdatabasetable
Presence check
When important data has to be included in a database and must not be left out,
we use a presence check to make sure data has been entered in certain fields.
A common mistake made by many students when asked which validation check
should be used on a field in a database, is to always say ‘presence check’. It
is the easiest to remember but is rarely used on any field except the key field.
With many of the fields illustrated above, it is not necessary to use a presence
check. The data can be updated or entered at a later date. We would, however,
probably use a presence check on the ISBN field in the Books table and on the
Borrower_ID field in the Borrowers table. These are key fields and, if data were
not entered, it would be very difficult to identify unique records if that data
were missing. You may well have come across this when completing online data-
capture forms. Fields sometimes have a red asterisk next to them, along with a
message instructing users that fields marked with an asterisk must be completed.
Presence checks are frequently used to check that you have entered data into
these fields and, if you have not, then there is usually a warning message saying
18
1 DATA PROCESSING AND INFORMATION
1
you must enter the data before you can move on to the next page. While the
presence check prevents you from missing out a field or record, it is otherwise a
fairly inefficient method, as it does not prevent you from entering incorrect or
unreasonable data.
Range check
Fields which contain numeric data tend to have
range checks designed for them.
Using the extract from the Books table above, we can see that the maximum
cost of a book is $29 and the minimum cost is $10. We can make sure that the
person entering the data does not enter a cost less than $10 or more than $29
by using a range check. A range check always has an upper value and a lower
value which form the range of acceptable values. If a value is entered that is
outside this range, an error message is produced. We will deal with the limit
validation check later in this section. Suffice to say at this point that a
limit
check
only has one boundary, which is either upper or lower.
So, to sum up, a range check checks that the data is within a specified range of
values. In our example, the cost of a book should lie between 10 and 29. A range
check could be carried out on each part of the Date_of_birth field in the Borrowers
table to ensure that 32 or 0 was not entered in the day part or that 13 or 0 was not
entered for the month part. However, this would not prevent impossible dates such
as 31/2/05 being entered (February can only have day values of 29 or less).
Type check
A
type check ensures that data is of a particular data type. In the above example,
Borrower_ID in the Borrowers table could be set up so that a validation rule would
ensure that every character was
numeric. Most students think that you only have
to set the field data type to numeric to prevent text from being entered. While
this is true, it would not be sufficient with the Borrower_ID field, as any leading
zeros, such as with 0205, would be removed in such a field. This would also not be
acceptable in most parts of the world in fields containing telephone numbers. The
data type of Borrower_ID would have to be set to
alphanumeric and a validation
routine would need to be created to only allow the digits 0 to 9 to be entered.
Fields in some databases would contain letters of the alphabet only. Setting the field
type to alphanumeric does not prevent numbers from being entered. A separate
validation routine would need to be set up. A type check can be performed on
most fields to make sure that no invalid characters are entered. For that reason, it is
often referred to as an
invalid character check, or by its shortened name, character
check
. One of the shortcomings of this check is that it would not alert you if you
had not typed in the correct number of characters in a particular field.
Length check
A
length check is performed on alphanumeric fields to ensure they have the
correct number of characters, but it is not used with numeric fields. Generally,
it is carried out on fields that have a fixed number of characters, for example the
length of a telephone number in France tends to be 10 digits. As the leading
digit is 0, the phone numbers would be stored in the alphanumeric data type, so
it is fairly straightforward to apply a length check.
Again, the application of this type of validation check is not to be confused with
setting up the data structure of a file so that the field length is a fixed number
of characters, as this does not prevent a user from typing in a string of less than
that fixed number of characters. In this instance no error message would be
produced, but with a length check, if you did not type in a correct number of
characters, an error message would result.
19
1.4 Checking the accuracy of data
1
In our database, we would apply a length check to the ISBN field in the Books
table. All ISBNs in our table are 13 characters long, so we need to have a check
to make sure we are not typing in fewer or more than 13 characters. We could
also apply this type of check to the Borrower_ID field in the Borrowers table
as these all appear to be four characters in length. This would not give us full
validation on the field, however, as letters of the alphabet, if entered, would not
be flagged up as an error, as the check only counts the number of characters.
It is also worth noting that length checks can be set to a range of lengths. For
example, phone numbers in Ireland tend to vary between 8 and 10 characters in
length. In that instance, you would need a routine that would not allow you to
type in fewer than eight characters or more than ten characters. This is not to be
confused with the range check described above.
Format check
New vehicle registration or licence plates in the UK follow an identical pattern.
The format is two letters followed by two digits followed by a space and then
three letters, for example, XX21 YYY. Any new registration number when
entered into a database can therefore be validated using a
format check or
picture check. If a combination of characters is entered which does not follow
this pattern, it should produce an error message.
In a format check, you can specify any specific combination of characters that
must be followed. In our example database, we could use a format or picture
check on the Class field in the Borrowers table. We could set it so that it must
be two digits followed by one letter. If somebody did not realise that it was the
Upper-school borrowers table and typed in 9B, for example, this would cause
an error message to be output saying that the field must have two numbers
followed by a letter. While a format check is very useful in this scenario, it would
not prevent somebody mistyping an entry, and entering 19A for example, by
mistake, as this would be accepted by the system. This form of validation is very
useful for checking dates if they are in a specific format, such as Date_of_birth
in the Borrowers table, which consists of two digits followed by a slash, followed
by two digits followed by a slash, followed by two digits.
Many students mistakenly think that you can put a format check on a
currency
field. This is not the case. The data is stored as a number and the currency symbol is
added by the software. The user is only entering numbers not the currency symbol.
Check digit
For our purposes we will consider the use of a
check digit as a means of
validating data as it is being input (some people would argue that it can be
used as a verification check in certain other circumstances, but that is outside
the scope of this book). It is used on numerical data, which is often stored as a
string of alphanumeric data-type. For example, the last digit of an ISBN for a
book is a check digit calculated using simple arithmetic. There are a number of
ways of calculating the check digit but one of the most frequently used methods
is described here. Using the first 12 digits in the ISBN, each individual digit
in the string is multiplied by 1 if it is in an odd-numbered position (such as
1st, 3rd and so on) or 3 if it is in an even-numbered position (such as 2nd, 4th
and so on). The resulting numbers are added together and divided by 10. If
the remainder is 0 that becomes the check digit, otherwise the remainder is
subtracted from 10 and that becomes the check digit. This is then added to the
end of the string, in this case as the 13th digit of the ISBN.
20
1 DATA PROCESSING AND INFORMATION
1
This happens at a stage before the data is entered, for example when the ISBN is
allocated before a book is published. When this data comes to be entered into a
database in a library, the computer recalculates the check digit to check whether it
gives the same check digit. If it does not, then an error message is produced. This
usually happens when the person entering the data has transposed two digits. In
our example if we typed in 9781447606189 instead of 9781474606189, the check
digit would be recalculated from the first 12 digits and would produce the check
digit 5 instead of 9. This would produce an error message.
Lookup check
A lookup check compares the data that has been entered with a limited number
of valid entries. If it matches one of these then it is allowed, but if it does not then
an error message is produced. It can only be used efficiently if there are a limited
number of lookup values, such as the days of the week, where there are only seven.
A lookup check is not to be confused with setting up a lookup list for the user to
select data items from when entering data. A lookup list does not produce error
messages and in certain circumstances users can still enter data which overwrites
the given list. In our database, we could use the Class field in the Borrowers table,
which would probably only have the classes 10A, 10B, 10C, 10D, 11A, 11B, 11C,
11D to choose from. If, using the example above, somebody entered 9B, the
computer would compare this to the values stored in a separate table and would
not be able to find a match and would produce an error message.
Consistency check
This is sometimes called an integrity check, but for our purposes, so that we
do not get confused with referential integrity (which we will meet later in the
book), we will refer to it as a
consistency check. It checks that data across
two fields is consistent. A good example would be to ensure data consistency
between the Class field in the Borrowers table and the Date_of_birth field. We
will assume that each student is allocated to a class according to their age when
they join the school. We will assume that, in our example, students in year 11
were born between 1 September 2004 and 31 August 2005. A consistency
check can be applied here so that if the first two digits of the class are 11 then
the students Date_of_birth must be between 01/09/04 and 31/08/05. If this
is not the case, then this validation check would output an error message. It is
often applied so that a field that contains a persons age must be consistent with
a field that contains their date of birth, though it should be stressed that storing
the age of a person is considered to be bad practice as it changes regularly and
needs updating often.
Limit check
We have already considered the range check; now we will look at a limit check.
A limit check is similar to a range check, but the check is only applied to one
boundary. For example, in the UK you are only allowed to drive from the age
of 17, but there is no upper limit. If somebody enters a number lower than 17
when asked to enter their age when applying for a driving licence, for example,
this will generate an error message. In our database it is difficult to apply a limit
check given the data provided.
The need for both validation and verification
The two methods of checking the accuracy of data are complementary. We have
seen that verification can report on errors that validation cannot and, similarly,
validation will pick up errors that verification cannot. Both are needed to ensure
that data is sensible and reasonable in the first place and also transferred accurately.
21
1.4 Checking the accuracy of data
1
The difference between validation and verification
As we saw earlier both these methods are essential to ensuring the entry of data is
accurate. It is important to emphasise that we are checking that the data was entered
accurately; we are not checking that the data itself is accurate. There are a number
of differences between verification and validation. Validation is always carried
out by a computer whereas verification can be carried out by a computer or by a
human. Validation is checking that the data is reasonable and sensible. Verification
is checking that the data has been copied or entered correctly but cannot tell
you whether the data is sensible or not. Similarly, validation does not help if you
have copied the data incorrectly. If you type in FD236CS instead of DF236CS, a
format check would accept this as valid input (as it is still two letters followed by
three
numbers followed by two letters) even though it has been copied incorrectly.
Verification would alert the user to this error. Data may have been invalid when
collected but verification only helps you to know that the data has been transferred
accurately to another medium. It does not help if the original data is incorrect.
Consider an electricity company which employs meter readers to read customers’
electricity meters. Suppose the meter reader accidentally writes down that, for
one customer, the number of units used was 4866 instead of the actual reading,
which was 4860. When the readings for all the customers have been collected,
they are entered into the computer, including the incorrect reading of 4866.
At this stage all verification would do is check that the number entered was the
same as that in the source document, 4866, so incorrect data would pass the
verification test. The validation check might be that readings must be between
2000 and 6000. Again, incorrect data would pass this test as well. This shows
how important it is that the correct data is collected in the first place, since
verification and validation might still allow the data to pass through the system
undetected. Verification is a way of ensuring that the user does not make a
mistake when inputting data whereas validation is checking that the data input
conforms with what the system considers to be sensible and reasonable.
Verification
As already stated, verification simply ensures that data has either been entered
accurately by a human or that it has been transferred accurately from one storage
medium to another. There are a number of methods of verification, some related
to manual entry of data and some related to data transfer.
Visual checking
Visual checking is carried out by the person who enters the data, who visually
compares the data they have entered with that on the source document. They
can see the differences and then correct the mistakes. This is the simplest form of
verification and can be done by reading the data on the screen to make sure it is the
same as the source document. An alternative method is to print out the data entered
and compare the printout side by side with the source document. Visual checking
can be rather time-consuming and possibly costly as a result. Another problem is
that the person who is checking that the data has been entered correctly may be the
same person who entered it. It is very easy for them to overlook their own mistakes.
A possible way around this is to get somebody else to do the check.
Double data entry
Double data entry, as the name suggests, involves the entry of data twice. The
first version is stored. The second entry is compared to the first by a computer,
and the person entering the data is alerted by the computer to any differences.
The user then checks to see if the second attempt is correct, corrects the error if
necessary, and continues entering the data.
22
1 DATA PROCESSING AND INFORMATION
1
The alternative to this way of entering data twice is for two different people
to enter the data, which is temporarily saved to the same hard disk. The
computer compares the two versions on the disk and alerts both operators to
any differences, which are then checked to see which version is correct. Some
systems cause the keyboards to freeze so that the people entering the data
cannot continue until the mistake is corrected.
Double data entry is similar to visual verification in that both methods compare
two versions of data and check that data is copied accurately, not checking
that the data collected in the first place was accurate or correct. The essential
difference is that visual verification is carried out by the user, whereas with
double data entry it is the computer that compares the two versions.
Parity check
As was mentioned in Section 1.1.1 the computer stores data in the form of
bits. Each string of bits is called a byte, with a byte normally consisting of
8 bits. Each bit is either 1 or 0, for example 10001101. A byte represents a
number between 0 and 255. Most computers use the
American Standard
Code for Information Interchange (ASCII)
. It is a code which uses numbers
to represent 96 English-language characters, with each character being given
a number between 32 and 127. The first 32 codes (0–31) in ASCII are
unprintable control codes and are used to control peripherals such as printers.
For example, 0 represents the
null string, 8 is equivalent to the backspace key
on the keyboard and 13 is equivalent to the return key. The codes 32–127
represent letters of the alphabet, numbers, and other symbols such as $, %, &.
So, for example, the ASCII code for uppercase I is 73, which is 01001001 in binary.
If we wanted to represent the word BROWN, we would use the ASCII codes
66, 82, 79, 87 and 78, which would in turn be represented by the following
bytes: 0100001001010010010011110101011101001110
In the early days of computers, there were only 7 bits used to contain information
with the extra bit used for a parity check. It soon became apparent that extra
characters needed to be represented in the system, such as the Spanish ñ, the
French è, the symbol ©. These form what is called
extended ASCII and provide
characters represented by the codes 128–255. This provided 128 additional
characters. Now, the
parity bit was added to make 9 bits. It is possible to use the
seven bits of the byte to represent data and have a parity bit to make it 8 bits, but
this would give us a limited set of characters to work with. In this section, we will
only be considering 9-bit parity checking.
Most computers use ASCII codes to represent text, which makes it possible to
transfer data from one computer to another. There has to be, however, a way to
check that data has been transmitted accurately. This is called parity checking,
which involves the use of parity bits. The parity bit is added to every byte (8
bits) that is transmitted. The parity bit is added to the end of the byte so that
there are an even number of 1s. Some systems use an odd number of 1s but we
will be looking at the most commonly used, which is even parity.
When data is being transmitted from one device to another, the sending device
counts the number of 1s in each byte. If the number of 1s is even, it sets the
parity bit to 0 and adds this on to the end of the byte. However, if the number
of 1s is odd, it sets the parity bit to 1 and adds it on. The result is that every
byte of transmitted data consists of an even number of 1s. When the other
device receives the data, it checks each byte to make sure that it has an even
number of 1s. If there is an odd number of 1s, then this means there has been
an error during the transfer of data. So how does this all work?
23
1.4 Checking the accuracy of data
1
Consider the example given above: the word BROWN
B – 01000010 there are two 1s (even) so we add 0; it now becomes 010000100
(two 1s [even])
R – 01010010 there are three 1s (odd) so we add 1; it now becomes 010100101
(four 1s [even])
O 01001111 there are five 1s (odd) so we add 1; it now becomes 010011111
(six 1s [even])
W 01010111 there are five 1s (odd) so we add 1; it now becomes 010101111
(six 1s [even])
N – 01001110 there are four 1s (even) so we add 0; it now becomes 010011100
(four 1s [even])
This is a very effective verification method. It makes sure that all bytes have
an even number of 1s. If a 1 within the byte is transmitted as a 0, then the
error will be trapped by the system. However, there are still errors which can
go undetected. If two 1s within the byte get transmitted as 0s, the byte will
still have an even number of 1s and so the system will not report an error. For
example, if 010101111 (W with parity bit added) is transmitted as 010001101
(F with parity bit added) the parity check will not notice this, as there is still
an even number of 1s. Also, if, somehow, a 1 and a 0 are transposed, such as
010000100 (B with a parity bit added) being transmitted as 010000010 (A with
a parity bit added), again the parity check will not report an error as there is still
an even number of 1s. More complex error-checking methods have had to be
developed, but parity checking is still very common because it is such a simple
method for detecting errors.
Checksum
Checksums are a follow-on from the use of parity checks in that they are used
to check that data has been transmitted accurately from one device to another.
A checksum is used for whole files of data, as opposed to a parity check which
is performed byte by byte. They are used when data is transmitted, whether
it be from one computer to another in a network, or across the internet, in an
attempt to ensure that the file which has been received is exactly the same as the
file which was sent.
A checksum can be calculated in many different ways, using different
algorithms, for example a simple checksum could simply be the number of bytes
in a file. Just as we saw with the problem with transposition of bits deceiving a
parity check, this type of checksum would not be able to notice if two or more
bytes were swapped; the data would be different, but the checksum would
be the same. Sometimes, encryption algorithms are used to verify data; the
checksum is calculated using an algorithm called a hash function (not to be
confused with a
hash total, which we will be looking at next) and is transmitted
at the end of the file. The receiving device recalculates the checksum, and then
compares it to the one it received, to make sure they are identical.
Two common checksums algorithms are MD5 and SHA-1 but both have been
found to have weaknesses. It is possible for two different files to have the same
calculated checksum, so because of this a newer SHA-2 and even newer SHA-3
have been developed which are much more reliable.
The actual checksum is produced in hexadecimal format. This is a counting
system that is based on the number 16, whereas we typically count numbers
based on 10. You can see what each hexadecimal value represents in this table.
24
1 DATA PROCESSING AND INFORMATION
1
Hexadecimal 0123456789ABCDEF
Base 10 0123456789101112131415
V Table 1.4 Hexadecimal values
MD5 checksums consist of 32 hexadecimal characters, such as
591a23eacc5d55a528e22ec7b99705cc. These are added to the end of the file.
After the file is transmitted, the checksum is recalculated by the receiving device
and compared with the original checksum. If the checksum is different, then the
file has probably been corrupted during transmission and must be sent again.
Hash total
This is similar to the previous two methods in that a calculation is performed
using the data before it is sent, then it is recalculated, and if the data has
transmitted successfully with no errors, the result of the calculation will be the
same. However, this time the calculation takes a different form; a hash total is
usually found by adding up all the numbers in a specific field or fields in a file.
It is usually performed on data not normally used in calculations, such as an
employee code number. After the data is transmitted, the hash total is recalculated
and compared with the original value. If it has not been transmitted properly or
data has been lost or corrupted, the totals will be different. Data will have to be
sent again or the data will have to be visually checked to detect the error.
This type of check is normally performed on large files but, for demonstration
purposes, we will just consider a simple example. Sometimes, school
examinations secretaries are asked to do a statistical analysis of exam results.
Here we have a small extract from the data that might have been collected.
Student ID Number of exam passes
4762 6
0153 8
2539 7
4651 3
V Table 1.5 Sample data
Normally, the Student ID would be stored as an alphanumeric type, so for the
purpose of a hash check, it would be converted to a number. The hash check involves
adding all the Student IDs together. In this example it would perform the calculation
4762 + 153 + 2539 + 4651 giving us a hash total of 12105. The data would be
transmitted along with the hash total and then the hash total would be recalculated
and compared with the original to make sure it was the same and that the data had
been transmitted correctly. We would use a hash total here because there is no other
point to adding the Student IDs together. Apart from verification purposes, the
hash total produced is meaningless and is not used for any other purpose.
Control total
A
control total is calculated in exactly the same way as a hash total, but is only
carried out on numeric fields. There is no need to convert alphanumeric data to
numeric. The value produced is a meaningful one which has a use. In our example
above, we can see that it would be useful for the head teacher to know what the
average pass rate was each year. The control total can be used to calculate this
average by dividing it by the number of students. The calculation is 6 + 8 + 7 + 3
giving us a control total of 24. If that is divided by 4, the number of students, we
25
1.5 Data processing
1
find that the average number of passes per student is 6. Obviously, the control total
check is usually carried out on much larger volumes of data than our small extract.
The use of a control total is the same as for a hash total in that the control total is
added to the file, the file is transmitted and the control total is recalculated. Just
as with the hash total, if the values are different, it is an indication that the data
has not been transmitted or entered correctly. However, both types of check do
have their shortcomings. If two numbers were transposed, say student 4762 was
entered as having 8 passes and 0153 with 6 passes, this would obviously be an
error but would not be picked up by either a control or hash total check.
Activity 1e
1 Briefly describe what is meant by verification.
2 Write a brief description for each of three different validation checks.
1.5 Data processing
As we saw in Section 1.1, data must be processed so that it can become
information. Data can include personal data, transaction data, sensor data and
much more. Data processing is when data is collected and translated into usable
information. Data processing starts with data in its raw form and translates
it into a more readable format such as diagrams, graphs, and reports. The
processing is required to give data the structure and context necessary so it can
be understood by other computers and then used by employees throughout an
organisation. There are several different methods of data processing, but the
three most popular ones are batch, online, and real-time. We will now consider
each of these in turn.
1.5.1 Batch processing
In business, a transaction occurs when someone buys or sells something, but
in IT the term ‘
transaction’ can mean much more, such as adding, deleting or
changing values in a database.
Batch processing is still used today as it is an effective method of processing
large volumes of data; sometimes millions of transactions are collected over a
period of time. The data is entered and processed altogether in one batch by
the computer, which then produces the required results. The processing of
one batch is often called a job. Batch processing allows computers to process
data when computing resources are not being fully utilised, such as overnight,
and requires very little, and often no, human interaction. Examples of batch
processing include payrolls and customer billing systems.
Batch processing provides a number of benefits. One of these is that it allows
a business to process jobs at the times of day when computer resources are
not being used fully, thereby saving the cost of the computer doing very little.
Companies are able to make sure that vital tasks which need immediate attention,
such as online services, can use the computer resources as and when needed.
They can then timetable batch-processing jobs for those times of day when online
processing tasks are fewer. Compared to
real-time processing, batch processing
requires a simpler computer system without complex hardware or software. Once
the system is up and running, it does not need as much maintenance as a real-time
system and, because there is less human interaction, data entry methods used in
batch processing tend to be more accurate. The actual processing, unlike the data
26
1 DATA PROCESSING AND INFORMATION
1
collection, is very fast with batch processing, with several jobs being processed
at the same time. Many utility companies such as water, gas and electricity
companies handle huge amounts of data in their billing systems. Batch processing
the collection of data, calculating and printing the bills can be done mainly at
night, so that the computers can be used to help control the actual delivery of the
utility during the busy part of the day.
Master and transaction files
In the next section, we will look at how batch processing is involved in
producing a company payroll. First, we need to be sure we understand that
two separate files are involved in this process. Let us consider the payroll
system of a company that pays its workers weekly. One file is called the
master
file
, which contains all the important data that does not change often, such
as name, works number, department, hourly rate, and so on. The other file,
called the transaction file, contains the data that changes each week, such as
hours worked. The master file would already be sorted in order of the key field
so that processing is easier to perform. The transaction file would probably be
in the order that the transactions were collected. Before these two files can
be processed together to produce the payroll, the transaction file needs to be
sorted in the same order as the master file and will also need to be checked for
errors using validation checks.
A master file would be used together with a transaction file in most batch
processing applications, including a computerised customer orders system (which
we will also be looking at shortly). When data is stored in order of a key field,
it often forms what is called a ‘sequential file’ (the data is in a predetermined
sequence). These used to be stored on magnetic tapes, but magnetic disks and
solid state drives are now used much more than magnetic tapes, which tend, in
modern systems, only to be used for backing up systems. The transaction file will
therefore be stored on disk, even though it holds data in sequential order. When
data is searched for in batch processing, each record is looked at one by one until
the computer finds the record it is looking for. This is called
sequential access.
The steps involved in updating a master file using a transaction file
There are occasions when the data in a master file has to be updated, for example
so that a new worker will get paid. We will assume that the changes will happen
on a weekly basis. It is likely that the transaction file would contain any needed
updates as well as the payroll data and there would only be one transaction file,
but to make it simpler to understand we will assume that the updating of the
master file happens first and then the payroll runs immediately afterwards.
The three types of transaction involved in updating a master file in the scenario
we have outlined are when:
» a worker moves to different department: their record must be amended or
changed
» a worker leaves the company: their record needs to be removed or deleted
» a new worker starts with the company: their record needs to be added.
We can give each type of update a letter. Changing an existing record in the
master file will be C, whereas deleting a record from the master file will be D
and adding a new record to the master file will be A. At the end of each week,
the computer system will process the data stored in the transaction file and
make any changes that are necessary to the master file, thereby producing an
updated master file.
27
1.5 Data processing
1
We can demonstrate this, using the following small sample of data.
ID Transaction Employee name Department
2 D Julia Bolero Sales
4C Nigel Ndlovu Buying
7 D Adrienne Pascal IT
11 A John Ward Stores
12 A Paolo Miserere IT
EOF End of file marker
V Table1.6Transactionfile
In order to update the master file, a new blank file will
be created and act as the new master file. The following
very basic algorithm will be followed.
Advice
This algorithm and the subsequent algorithms in this
chapter are simplified versions of what an algorithm might
look like. More efficient ways of writing these and other
algorithms will be covered in Chapter 4.
We will look at REPEAT…UNTIL in more detail in Chapter 4.
1 First record in the transaction file is read
2 First record in the old master file is read
3REPEAT
4 IDs are compared
5 IF IDs do not match, old master file record is
written to new master file
6 IF IDs match transaction is carried out
7 IF transaction is D, old master file record
is not written to new master file
8 IF transaction is C, data in transaction file
is written to new master file
9 IF IDs match, next record from transaction file
is read
10 Next record from master file is read
11 UNTIL end of old master file
12 Data in transaction file record is written to new
master file
13 Any remaining records of the transaction file are
written to the master file
What is happening in this algorithm is that the computer reads the first record in the
transaction file and the first record in the old master file. If the ID does not match,
as it does not in this case (ID is 2 in the transaction file but the ID in the master file
is 1), there is no change necessary. The computer simply writes the old master file
record to the new master file and misses out the next instructions 6 to 9 (IDs do not
match so they can be ignored) then the next record from master file isread.
After the last record, an End of file marker would be stored. As it has not
been read yet we cannot be at the end of the file so the UNTIL statement
ID Employee name Department
1 Jose Fernandez Buying
2 Julia Bolero Sales
3Louis Cordoba Sales
4Nigel Ndlovu Stores
5 Bertrand Couture Buying
6 Lionel Sucio Stores
7 Adrienne Pascal IT
8 Gurjit Mandare Stores
9Iqbal Sadiq IT
10 Tyler Lewis Buying
EOF End of file marker
V Table 1.7 Master file
28
1 DATA PROCESSING AND INFORMATION
1
tells the algorithm to go back to the REPEAT instruction. It starts again at
the ‘IF IDs do not match’ in instruction 5. The IDs match in this example
because the ID is 2 in the transaction file and the ID in the master file record
is now 2. Instruction 5 is therefore ignored and instruction 6 indicates that
the transaction is carried out and moves on to instruction 7. In this case the
transaction is D so the record has to be deleted, so the old master file record is
not written to the new master file. Instruction 8 is ignored as the transaction is
not C. Instruction 9 causes the next record from the transaction file to be read.
Instruction 10 means the next record from the master file is read. We again meet
the ‘UNTIL end of old master file’ instruction.
As the algorithm has yet to meet the end of file marker, we go back to the REPEAT
which is instruction 3. We are now looking at the second record of the transaction
file (ID 4) and the third record of the old master file (ID 3). If they do not match
(instruction 5), which they do not, the old master file record is written to the new
master file and we jump to instruction 10 and the next record (the fourth, ID 4) of
the old master file is read.
We are not at the end of the file, so the algorithm takes us back to instruction
3 and then on to instruction 4 then instruction 5: the ‘IF IDs do not match
instruction. However, this master file record matches with the second record
in the transaction file so the transaction is carried out. It is C, so the master
file record is not copied across, instead the record from the transaction file
(apart from the C) is copied into the new master file. The next record from the
transaction file is read and then the next record from the master file is read.
This carries on until the EOF marker is met in the old master file. The UNTIL
instruction is now true, so the algorithm moves on to instruction 12 so the
current transaction record is written to the master file. Then instruction 13
is followed and the remaining records of the transaction file are added to the
master file, in this case two records.
The steps are carried out as shown, making two assumptions. One is that all
additional records will be at the end of the transaction file. This is usually the
case as new employees would be given the next available ID and the transaction
file would be sorted so these new workers would appear at the end of the file.
The other assumption, which is not really likely in a real situation, is that the
transaction file records will have fields identical to all those in the master file.
To make it easier to follow, we have not included, for example, the rate of pay
field in those records which need to be added, but this field would have to be
included in the master file for every employee.
ID Employee name Department
1 Jose Fernandez Buying
3Louis Cordoba Sales
4Nigel Ndlovu Buying
5 Bertrand Couture Buying
6 Lionel Sucio Stores
8 Gurjit Mandare Stores
9Iqbal Sadiq IT
10 Tyler Lewis Buying
11 John Ward Stores
12 Paolo Miserere IT
EOF End of file marker
V Table1.8Thenewmasterfile
29
1.5 Data processing
1
Records 2 and 7 have been deleted.
Record 4 has been changed.
Records 11 and 12 have been added.
Activity 1f
Amend the algorithm on page 27 so that it would work if there were no records
to be added, that is if there were no records in the original transaction file on
page 27 after 7, D, Adrienne Pascal, IT.
Use of batch processing in payroll
As mentioned earlier, batch processing is used to calculate wages in a payroll.
Let us look at a typical master file and transaction file which might be used in a
payroll system. We will only consider a very small company but, in real life, it is
not unusual for payroll systems to cater for thousands of employees.
We can assume that the
transaction file has been
sorted and validated. The
system would have to go
through each transaction file
record and using the hourly
rate (Rate) from the matching
master file record, calculate
that employees wages for the
week. This would be added
to the employee’s wages paid
so far this year and replace
the Wages_to_date value. We
shall assume that the workers
pay no tax and have no other
deductions.
1 First record in the transaction file is read
2 First record in the old master file is read
3REPEAT
4 IDs are compared
5 IF IDs do not match, old master file record is
written to new master file
6
IF IDs match transaction/calculation is carried out
7
Computer calculates the pay, Rate (from master file)
multiplied by hours worked (from transaction file)
8
Wages _ to _ date is updated and record is written
to new master file
9 IF IDs match, next record from transaction file
is read
10 Next record from master file is read
11 UNTIL end of transaction file
12 Remaining records of the master file are written to
the new master file
ID Hours_worked
036 40
469 40
578 38
778 40
789 40
EOF End of file
V Table1.9Transactionfile
ID Department Rate ($) Wages_to_date
036 Sales 20 1280
047 Buying 25 1475
165 Buying 25 1525
469 Sales 20 1160
512 Stores 15 825
545 Sales 20 1220
578 IT 30 1860
682 Sales 20 1080
778 IT 30 1920
786 Buying 25 1575
789 IT 30 1830
861 Stores 15 795
EOF End of file
V Table1.10Masterfile
30
1 DATA PROCESSING AND INFORMATION
1
The basic outline of the algorithm, to show how the payroll is processed, is
quite similar to the updating algorithm, but the middle part is different:
Notice that because no additional records need to be added to the master file,
we stop the processing when we get to the end of the transaction file.
Activity 1g
Work through this algorithm using the two files shown and write down the new
master file.
Use of batch processing with customer orders
We have already mentioned that customer orders can be dealt with
using batch processing. When a company receives an order from a
customer, it is added to a transaction file. At the end of the day, the
transaction file is used together with a master file to check the items
are in stock and to update the master file with the new number of
items in stock after making deductions arising from the orders. If items
are not in stock, they must be ordered from a supplier. If they are in
stock, then they are added to what is called a picking list which is sent to the
warehouse. The warehouse staff then find the goods that have been ordered and
package them ready for shipment to the customers. The next step is to allocate the
packaged goods to delivery vehicles. At the same time, invoices are produced and
sent to customers for immediate payment or added to their account (if customers
make payments every month). Let us look at how account payments are processed.
The transaction file will contain details of the orders a customer has made in
the last month, together with any payments the customer has made. We will
consider a simple master file which just contains the money owed by each
customer, which is called ‘the balance.
Again, we can assume that the transaction file is sorted in Cust_no order and
has been validated. The algorithm describing the process would be similar to
the payroll algorithm:
1 First record in the transaction file is read
2 First record in the old master file is read
3REPEAT
4 Cust _ nos are compared
5 IF Cust _ nos do not match, old master file record
is written to new master file
6 IF Cust _ nos match transaction/calculation is
carried out
7
Computer calculates New _ orders minus Payment _
made and subtracts from Balance
8
Balance is updated and record is written to new
master file
9 IF Cust _ nos match, next record from transaction
file is read
10 Next record from master file is read
11 UNTIL end of transaction file
12 Remaining records of the master file are written to
the new master file
Cust_no New_orders Payment_made
219 320 200
451 870 1500
523 190 340
834 520 250
V Table 1.11 Transaction file
Cust_no Balance
138 0
187 0
219 0
451 -800
487 -2 60
523 -340
764 0
802 -920
834 0
869 -540
V Table1.12Masterfile
31
1.5 Data processing
1
Activity 1h
Work through this algorithm using the two files shown and write down the new
master file.
1.5.2Onlineprocessing
We have seen how batch processing involves gathering data together ready for
processing at a later date. This has an obvious drawback in that the processing
is delayed. In some cases, this is not a problem, for example with applications
such as payroll, which only need to be processed weekly or monthly, or utility
billing systems which are often processed every three months. Some processing,
however, has to be done almost immediately, such as at supermarket checkouts
or interrogating a database for an employee’s details. The original definition
of online processing was that the user was in direct communication with a
central computer. This has now evolved to include any aspect of IT which takes
place over the internet. In this section, we shall be looking at how applications
such as
electronic funds transfer (EFT) and automatic stock control, among
others, take place. One of the differences between batch processing and online
processing, is that in batch processing data is searched using sequential access,
whereas direct access tends to be used in online processing. Direct access is
simply the ability to go straight to the record required without having to read all
the previous records.
Uses of online processing
When data is input into an online system, processing takes place almost
immediately with just a short delay, so short that the user believes they are in
direct communication with the computer. Each transaction is processed before
the next transaction is dealt with. This means that online processing can be
used in a variety of ways. We shall look at some of these here.
Electronic funds transfer
One definition of electronic funds transfer (EFT) is that it is the electronic
transfer of money from one bank account to another using computer-based
systems, without the direct intervention of bank staff. Examples include the
use of an
automated teller machine (ATM), a direct payment of money to
another person, and direct debits, when a company debits the customer’s
bank account for payment for goods or services. EFTs can be transfers
resulting from credit or debit card transactions at a supermarket, a store
or online. They usually involveone bank’s computer communicating with
another bank’s computer, though not always, such as when the ATM being
used belongs to the customer’s bank.
Most people receive their wages as a result of an EFT. Money from the employer’s
bank account is transferred electronically to the employees bank account.
EFT has become a common way of paying bills. For example, you may decide
that your house needs redecorating, so you ask a painter to come and paint your
house. When the painter has finished, he or she will require payment. One of
the easiest ways of doing this is to take the painter’s bank details and transfer
the money from your account to the painter’s account.
The following steps describe what happens after you have logged in to your
online bank account, although the process may differ slightly from bank to
bank, and assumes you are paying someone new:
32
1 DATA PROCESSING AND INFORMATION
1
1 Select transfer money
2 Select the account you wish to transfer money from
3 Select new payee
4 Type in sort code, account number and payee name
5 Type in amount to transfer
6 Computer checks available balance
7 If you have sufficient funds, the transaction is
authorised
8 Your banks computer contacts payees banks
computer which searches for the payees record
9 Amount is subtracted from your account balance
10 Amount is added to payees account balance
The most common form of electronic funds transfer is purchasing goods in
a store or supermarket when paying at a checkout. This is called EFTPOS
and stands for Electronic Funds Transfer at Point Of Sale. Checkouts at
supermarkets are called point-of-sale terminals. When a customer goes to a
checkout to pay for their goods, they insert their bank card and the following
steps are followed. (This assumes it is not a contactless transaction, in which
case steps 3 to 8 are omitted. Most countries only allow contactless transactions
if the value of the goods is less than a certain amount.)
1 Card chip is read and checked to make sure it is in
date and it is a valid card number
2 If not, card is rejected and transaction terminated
3 PIN is entered by customer into PIN pad
4 Chip reader determines PIN from the chip
5 The two PINs are compared
6 If they are identical the transaction is authorised
7 If they are not identical, error message appears on
the chip reader and two more attempts are allowed
8 If the two PINs are still not identical, the
transaction is rejected and error message issued
9 Customers bank is contacted by supermarkets computer
10 Customers bank retrieves customer’s record
11 Customers bank checks if sufficient funds in account
12 If there are insufficient funds then transaction is
rejected
13 If there are sufficient funds then transaction is
authorised
14 The amount of the bill is deducted from customers
account
15 The amount of the bill is credited to supermarkets
account
Automatic stock control
There are many ways in which IT can be used in stock control. In this section we
are going to concentrate on automatic stock control. This involves the use of an
automated system where stock is controlled by a computer with little human input.
We have already met the use of EFTPOS terminals. These also serve another
purpose in stores and supermarkets; they are used for stock control. The
33
1.5 Data processing
1
checkout operator swipes the barcode of an item and the computer uses this
to update the stock. The terminal in a supermarket or store consists of a
screen (which can be a touchscreen), a barcode reader to input the barcode
of the product, a number pad to enter the barcode in case the barcode label
is damaged, and electronic scales. Each terminal is connected to a computer
network. The hard disk on the network server stores a file (we shall call it the
product file) containing the records of each product that is sold. Each record
consists of different fields containing data, for example:
» barcode number: the number which identifies each different product; this is
the key field because it is different for each product
» product details: a description, such as tin of beans, packet of teabags and so on
» price of the product
» size: weight or volume of the product
» number in stock: the current total of that product in stock; this changes
every time a product is sold or new stock arrives
» re-order level: the number which the computer will use to see if more of
that product needs re-ordering. If the number in stock falls to this level, the
supermarket or store must re-order
» re-order quantity: when the product needs re-ordering, this is the number of
products which are automatically reordered
» supplier number: the identification number of the supplier which will be used
to look for the details on the supplier file.
There is also a file (the supplier file), which contains details of the supplier of
each product, including their contact details. It is more than likely that these
two files would be stored as separate tables in a
relational database.
The processing involved in automatic stock control is as follows:
1 The products barcode is input from the barcode
reader
2 The computer searches for this barcode number in
the product file and finds it using direct access
3 The number in stock is reduced by one
4 The computer then compares the number in stock
with the re-order level
5 If the number in stock is not equal to the
re-order level then go back to step 1 and repeat
6 If the number in stock is equal to the re-order
level then the computer creates an automatic order
7 It looks up the re-order quantity of that product
8 It looks up the supplier number of that product
9 It searches the supplier file for the record
corresponding to the supplier number found in the
product file
10 It sends the order automatically to the supplier
using the suppliers contact details
11 Go back to step 1 and repeat
When new goods are delivered, the computer automatically updates the product
file by following steps 1 and 2 and then increasing the number in stock by the
re-order quantity.
34
1 DATA PROCESSING AND INFORMATION
1
Electronic data exchange
Electronic data exchange is often referred to as Electronic Data Interchange
(EDI). For the purposes of this book we will call it EDI, though the terms are
interchangeable. This is a method of exchanging data and documents without
using paper. The documents can take any form such as orders and invoices, with
the electronic exchange between computers using a standard format. An invoice
is a type of bill sent to a customer containing a list of goods sent or services that
have been provided, including a statement of the sum of money due for these.
Most companies create invoices using a computer system. Many then print a paper
copy of the invoice and post it to the customer. If the customer is a business, they
will often type the details into their computer, meaning that the whole activity
of sending and receiving invoices is actually the transfer of information from the
seller’s computer to the customer’s computer. EDI replaces this activity with an
electronic method. The physical exchange of documents could take between three
and five days. EDI often occurs overnight and can take less than an hour.
The old paper-based method was usually made up of these steps:
» A company decides to buy some goods, creates an order and prints it.
» Company posts the order to the supplier.
» Supplier receives the order and enters it into their computer system.
» Company calls supplier to make sure the order has been received, or supplier
posts a letter to the company to say it has received the order.
An EDI system generally has these steps:
» A company decides to buy some goods, creates an order and does not print it.
» EDI software creates an electronic version of the order and sends it
automatically to the supplier.
» Supplier’s computer system receives the order and updates its system.
» Suppliers computer system automatically sends a message back to the
company, confirming receipt of the order.
EDI systems save companies money by providing an alternative to systems that
require humans to operate them, thereby saving wages that would be paid to
people who would sort and search paper documents. There is no need to pay
operators to manually enter the data. EDI also reduces human error during data
entry, since there would be no need to re-enter data that had been sent originally.
Productivity is improved as more documents are processed in less time.
EDI systems are often used because of the security aspects of the system. As
alternative systems using the internet have grown, EDI has had to innovate and
this has been achieved largely by increased security in the transmission of data.
EDI is also used by some examination boards to allow exam entries to be made
and for issuing results. It is also used by hospitals to send and receive documents
to and from doctors, again due to the increased security of this system.
Business-to-business buying and selling
Business-to-business – often termed B2B – refers to buying and selling between
two businesses, rather than between a business and an individual customer
(B2C). The value of B2B transactions is noticeably higher than that of B2C, as
businesses are more likely to buy higher-priced goods and services and to buy
more of them than individual customers are. A car manufacturer, for example,
will buy thousands of tyres, whereas a customer is only likely to buy four tyres at
most when replacements are needed.
35
1.5 Data processing
1
Many companies still use EDI for sending orders and invoices, but there are other
aspects of B2B which require online processing. Businesses can buy and sell using
online marketplaces, but many B2B sellers do not take advantage of these. There
is sometimes little difference between B2C and B2B marketplaces, for example
Amazon has B2C and B2B versions of its site. However, you are advised to read
the syllabus regarding the use of brand names. B2B marketplaces work just like
a B2C marketplace in that they connect many sellers to buyers. Buyers have the
opportunity to compare and buy products from many different sellers all on one
site. However, a B2B marketplace is different to a B2C marketplace in that bulk
orders can be placed, discounts can be received for ordering large quantities and
orders can be edited online. Sellers have benefits too, such as lower costs since they
do not have to spend as much money on marketing or setting up a larger website,
as the marketplace is responsible for that part of marketing, although the business
has less control over the look of its products on the website. Sellers can save the time
that would be spent on setting up the sales aspect of a website. The audience is now
global and to a certain extent, a captive audience. Marketplaces can also be used to
test out new products by putting a few up for sale. If they sell well, the volume can
be increased and if not, they can be easily withdrawn.
B2C online transactions are fairly straightforward, whereas B2B transactions
tend to be more complicated. B2B selling prices can vary a great deal with
discounts needing to be taken into account. The quantity of products being
sold is much greater, resulting in more complicated shipping requirements. In
addition, B2B tends to have more government regulation and complex taxation.
However, the failure of companies to invest in online buying and selling can
lead to being left behind by competitors who sell more and make greater profits.
Although EDI is still used by many companies, there are other methods of
buying and selling. Companies can have their own selling website that can be
used by other companies to buy their goods. E-procurement is the term used
to describe the process of obtaining goods and services through the internet.
E-procurement software can be used by sellers and buyers to link directly to
each other’s computer systems.
Online stores
Internet shopping began in the 1990s. At that time, however, the vast majority of
the worlds population was not even aware of it. Today, many people shop online,a
development that has arisen due to the emergence of online stores, fromboth
traditional high-street big names and new online-only companies. Online stores
have become many peoples preferred places to shop for many reasons.
» Customers are not rushed by store assistants into hurriedly comparing
products and prices.
» Customers can shop at a convenient time for them.
» Customers do not have to spend time and money travelling around different
shops to find the best bargains; it is much faster and cheaper to do it online.
» If a local branch of a chain of stores has closed, customers can still shop with
that chain online.
» Customers can look at a wide range of shops all around the world.
» Items are usually cheaper online because warehouse and staff costs are lower
than maintaining main street stores.
» Stores can deliver goods at a time to suit the customer.
» Supermarkets can remember the customer’s shopping list and favourite
brands, making it easier for the customer to enter their shopping list.
» There is a greater choice of manufacturers. Many main street shops can only
stock items from a few manufacturers.
36
1 DATA PROCESSING AND INFORMATION
1
A typical online store website opens with a home page which contains different
categories, on tabs across the top or down the side or on a drop-down menu.
Customers are able to click on a category tab, which takes them to a different
web page on the site. They can browse the products within that category to get
to the one they want. After opening the store website and browsing product
categories, the customer then decides what they want to buy. At this point,
some stores may ask for a postcode or zip code to check that they actually
deliver to that area. In a real store or supermarket, the customer would place
products in a shopping trolley or basket; similarly, with an online store, they
place them in a virtual shopping basket. This is usually done by clicking on an
‘add to basket’ or simply ‘add’ icon. As with a real store, items can be removed
from, as well as added to, a customer’s shopping basket and, when the customer
has decided that they have finished shopping and they want to pay, they go to
the checkout.
Here is an example of the steps, in the form of a simple algorithm, required at
the checkout, although the actual sequence of steps varies depending on the
online store’s website.
1 IF user is new customer, they must register
2 Enter a username
3 Enter and verify (by double entry) password
4 Enter phone number and email address
5 Enter delivery/shipping address
6 Enter billing address if different to shipping address
[some sites only]
7 Choose speed/cost of delivery [some sites only]
8 Enter type of card and credit/debit card number
9 Enter date of expiry
10 IF user is existing customer
11 Log on by entering username and password
12 Confirm delivery address
13 Choose speed/cost of delivery [some sites only]
14 Select credit card/debit card account to be debited if stored,
otherwise enter type of card and credit/debit card number
15 Enter card security code [some sites may not require this]
16 Confirm order
An email address is nearly always needed, so the store can notify the customer
when the order has been received. Some sites inform you of the progress of the
order’s delivery.
The delivery address is needed so the store knows where the goods will be sent to.
The billing address is required because the store wants to know where to send
the bill as this is sometimes different to where the goods will be delivered. As
payments are made electronically, this piece of information is largely academic,
but it can be used for additional credit checks.
The customer is often able to choose how quickly the goods should be delivered
or choose a delivery time slot, if the store has its own delivery vehicles. Some
37
1.5 Data processing
1
stores offer same-day or next-day delivery, although usually the quicker the
delivery, the higher the cost.
Often, the customer has to pay a delivery charge as well as the price of the
goods.
1.5.3 Real-time processing
Real-time processing is an example of online processing in that it requires
the inputs to go directly to the CPU of the computer, but the response time
of the computer must be immediate, with no delay whatsoever. This type of
processing is usually found in systems that use sensors, for example computer-
controlled
greenhouses, often referred to as glasshouses, which we will be
looking at in more detail in Chapter 3. Temperature, light and moisture sensors
are all used to monitor physical variables and send these as input data to the
computer so that it can take immediate action. For example if the temperature
falls below a certain value, the heater is automatically switched on. If a batch-
processing system was used, the temperature might have been below the
required value for a long period of time, damaging or even killing the plants
inside, so it is essential that the response from the computer is immediate. Real-
time processing is continuous so the process is never ending unless the user
switchesthe system off.
Uses of real-time processing
Real-time systems are systems where the output affects the input. Consider the
glasshouse example above, where the input to the computer or microprocessor
is the temperature. If the temperature is below the required level, the computer
turns the heater on. How this is achieved will be described in greater detail in
Chapter 3. The temperature will now rise and because this process is continuous
the temperature sensor will feed the new temperature back to the computer.
Here, the output, which is the switching on of the heater, has affected the
input, the temperature. This cycle continues until the desired temperature is
reached. Unlike other online systems, any output is produced quickly enough so
that it affects the system before the next input is received. The output happens
immediately. We will now look at three uses of real-time systems.
Central-heating systems
Boiler
Touchscreen
or keypad with
temperature
sensor
Microprocessor
V Figure1.2Atypicalcentral-heatingsystem
38
1 DATA PROCESSING AND INFORMATION
1
In the system shown in Figure 1.2, there is what is called a combination or
combi boiler. The boiler contains the pump as well as the means of heating
the water. A central-heating system is a real-time system since it involves the
use of sensors; in this case temperature sensor(s) are used in central heating to
continuously monitor the physical variable, the temperature of the house. As
with any real-time system, it involves the use of a feedback loop. This is called a
closed system’ in that the temperature is fed into the system. The boiler heats
the water, which causes the temperature to rise. This in turn, eventually,causes
the boiler to be switched off by the microprocessor, which then results in a drop
in temperature and the boiler has to come back on again and so the sequence is
repeated. We know it is a real-time system since the output affects the input.
In a microprocessor-controlled central-heating system, users select their
required temperature using a touchscreen or keypad. The microprocessor reads
the data from a temperature sensor on the wall and compares it with the value
the user has requested. If it is lower, then the microprocessor switches the boiler
and the pump on. If it is higher, the microprocessor switches both off. In order
for the microprocessor to process the data from the temperature sensor (which
is an analogue sensor), it uses an analogue-to-digital converter to convert this
data into a digital form that it can understand. As a result of the input, the
microprocessor may or may not send signals to
actuators which open the gas
valves in the boiler and/or switch the pump on.
The microprocessor is also used to control the times at which the system
switches itself on and off. For example, users can set the system to come on
before they get up in the morning and set it to switch off just before they go out
to work. When the system is off, the microprocessor ignores all readings.
A simple algorithm shows how the system works:
1 User enters required temperature using keypad/
touchscreen
2 Microprocessor stores required temperature as a
pre-set value
3 Microprocessor receives temperature from sensor
4 Microprocessor compares temperature from sensor to
pre-set value
5 If temperature from the sensor is lower than the
pre-set value, the microprocessor sends a signal to
an actuator to open the gas valves
6 If temperature from the sensor is lower than the
pre-set value, the microprocessor sends a signal to an
actuator to switch the pump on
7 If temperature is higher than or equal to the pre-set
value, the microprocessor sends a signal to switch
the pump off and close the valves
8 This sequence is repeated until the system is
switched off
39
1.5 Data processing
1
Air-conditioning systems
Air-conditioning systems are more sophisticated than central-heating systems
and involve units such as valves, compressors, condensing units and evaporating
units, which together form a system that, basically, feeds cold air into a room
via a unit containing enclosed fans to circulate it around the room. We shall just
concern ourselves with what happens in one room in a house. Each room has a
temperature sensor and the system uses this to determine whether it needs to
switch the fans on or, in the case of more complex systems, change the speed
of the fans (we will only be considering the simpler version). The user, as with
a central-heating system, enters the required temperature using a keypad or
touchscreen.
Again, we can use a basic algorithm to describe the process:
1 User enters required temperature using keypad/
touchscreen
2 Microprocessor stores required temperature as a
pre-set value
3 Microprocessor receives temperature from sensor
4 Microprocessor compares temperature from sensor to
pre-set value
5 If temperature from the sensor is higher than the
pre-set value, the microprocessor sends a signal to an
actuator to switch the fans on
6 If temperature is lower than or equal to the
pre-set value the microprocessor sends a signal to
an actuator to switch the fans off
7 This sequence is repeated until the system is
switched off
Guidance system for rockets
A guidance system can be used to control the movement of different types of
vessel or moving object, such as a ship, aircraft, missile, rocket or satellite. It
involves the process of calculating the changes in position and velocity and
other more complex variables, as well as controlling the object’s course and
speed.
A guidance system, like all computer systems, has inputs, processing, and
outputs. The inputs include data from sensors, the course set by the controller
and data from radio and satellite links. The processing involves using all the
input data to decide what actions, if any, are necessary to maintain or achieve
the required course. The outputs are the actions decided upon as result of the
processing and use devices such as turbines, fuel pumps and rudders, among
others, to change or maintain the course.
Missiles have a high-precision, real-time guidance system built into their nose.
The guidance system has a radar system, which consists of a radar which looks
forwards and one that looks downwards. There is a navigation system and a
divert-and-attitude-control’ system, which is able to increase or decrease the
amount of thrust in different engines driving the missile and thereby control
40
1 DATA PROCESSING AND INFORMATION
1
the direction and speed of the missile. Basically, the radar scans the surrounding
terrain and feeds the information into the navigation system, which then uses
the data from the radar together with its stored maps of the area to calculate the
required flight path, avoiding obstacles. The new flight path is immediately fed
in to the divert-and-attitude-control system so that it can adjust the direction
and speed of the missile to achieve this flight path. The flight path is constantly
adjusted to ensure the missile does not drift off its projected flight path. A
missile guidance system is an example of a true real-time system, because there
must be no delay between the navigation system realising that the flight path
has been deviated from and the flight path being adjusted.
1.5.4 Advantages and disadvantages of different methods
of processing
It is important to give both sides of the argument and to take into account the
specific context and the user’s needs when comparing methods.
While performing real-time processing, there is no significant delay in response
whereas with batch processing the processing can take place well after the initial
inputs have been entered. In real-time processing, information is always up to
date, so the computer or microprocessor is able to take immediate action. With
batch processing, the information is only up to date after the master file has
been updated by the transaction file. Because real-time processing occupies the
CPU constantly it can be very expensive, unlike batch processing which only
uses the CPU at less busy times. Real-time processing needs expensive, complex
computer systems whereas lower specification computers can suffice for batch-
processing systems. Data is collected instantaneously with real-time systems
whereas gathering of data for batch-processing systems can take a long time
occupying more people so leading to greater expense in wages.
Comparing real-time systems to online systems, it is easier to maintain and
upgrade online processing systems as computers are less busy overnight, so
shutting down banking or shopping systems is less of a problem at less busy
times, unlike real-time systems which do not have less busy times.
Comparing online systems to batch-processing systems, the extra hardware
requirements, input devices, workstations and so on of a large online processing
system can make it more expensive than batch processing. In a company
using an online processing system, each transaction requires the entry of
information immediately. This means that salespeople and other employees
must be connected to the system at all times, unlike batch processing which
only requires a limited number of employees to enter all the data at once. This
leads to savings in wages. An advantage of batch processing compared to online
processing is that it is less expensive than online input as it uses very little
computer processing time to prepare a batch of data or transaction file, and
this can be done at a time convenient to the employees responsible for entering
the data. In online processing errors are revealed, and can be acted upon,
immediately, compared to a batch processing system, where if there is an error it
is only revealed when processing takes place. This can be overnight and as there
may be no human involvement at this stage, management may not be aware of
it until some time after the error actually occurred. Batch processing can be
carried out overnight when the computer would not normally be used. This
means that a company can get more work out of its computer hardware.
41
1.5 Data processing
1
Interrogative databases are not well suited to batch processing as details may
be required. Consider the case where a company uses batch processing for
employee details, payroll and so on. An employee may have an accident and so
a manager needs to contact the employee’s spouse immediately. It would be
pointless running this query overnight. The system must be available whenever
necessary so that the business can function properly.
Most modern business systems use a mixture of batch and online processing,
in order to overcome some of the disadvantages described and benefit from the
advantages.
Examination-style questions
1 A collection of data could be this: johan, Σ, $, ,, AND
Explain why these are regarded as just items of data. In your explanation give
a possible context for each item of data and describe how the items would
then become information.
[5]
2
A company uses computers to process its payroll, which involves updating
a master file.
a State what processes must happen before the updating can begin. [2]
b
Describe how a master file is updated using a transaction file in a
payroll system. You may assume that the only transaction being carried
out is the calculation of the weekly pay before tax and other deductions.
[6]
3a
Name and describe the purpose of three validation checks other
than a presence check.
[3]
b
Explain why a presence check is not necessary for all fields. [3]
4
A space agency controls rockets to be sent to the moon.
Describe how real-time processing would be used by the agency.
[5]
5
Describe three different methods used to carry out verification. [3]
6
L12345 is an example of a student identification code.
Describe two appropriate validation checks which could be applied to
this data.
[2]
7
Describe three drawbacks of gathering data from direct data sources. [3]