Data Analysis:Notes Data Errors

Notes: Key concepts about data errors

The next step is to manage data. One part of this step is to evaluate data for errors and suitability. We provide some notes on errors to assist you.

It is important to examine the data for errors and also to think about other potential problems that may exist with the data. Only then can you assess the usefulness of conclusions you may draw from the data. There are two main types of errors, missing data and incorrectly recorded data.

Missing data

Missing data occurs in two main ways. There can be blank cells in an existing row of the data that can be easily seen or there can be entirely missing observations (rows) that are difficult to detect.

Blank cells in data

In the current worksheet, this will appear as a blank cell in a priority syndrome investigation event (a blank cell in a row).

This may occur because people forget to record data, or because they have no time to enter it.

An example of missing data in the isikhnas_priority disease syndromes_March_2013.csv worksheet data is that there are many missing values in the Tanggal diinvestigasi column. That is, some investigators have failed to record the date that the priority syndrome report was investigated. Another possibility is that no investigation occurred. We will assume investigation occurred and the data was not recorded in our exercises below. The main way a veterinarian can correct these missing cells is to contact the original data enterer and collect the missing data.

Missing observations (records)

Entire priority syndrome investigation entries may never be recorded and are thus missing (i.e. entire missing rows of the isikhnas_priority disease syndromes_March_2013.csv).

Unfortunately it is very difficult to detect this sort of problem as the data is just missing. If there is some sort of pattern to the missing rows, this missing data will cause problems because selection bias can result. Selection bias results in the collected data not resembling the population being studied.

We have already seen an example of this above. Recall that no old cattle were collected in a sample of district cattle that aimed to estimate the prevalence of brucellosis. This resulted in only young and healthy cattle being included in a sample. Subsequently the sero-prevalence of brucellosis was assessed as 0%. In fact older cattle in the district were sero-positive to brucellosis but these cattle were not sampled. The result is missing rows of data from our dataset (i.e. all the old cattle are missing).

We have to think carefully about the method of collecting the data and make an educated guess about whether there will be selection bias because this influences the confidence we have in our results.

Incorrect data (errors)

Random errors

Incorrectly recorded data can occur in most data sets. Fortunately incorrectly recorded data is less common in iSIKHNAS data than in many other datasets.

This is because extreme data that is not sensible cannot be entered into iSIKHNAS. The data entry methods prevent this. For example, if the weight of a cow is recorded as 5000 kg that would be an error, since it is 10 times the value we might expect from an adult cow. These sorts of extreme errors are generally prevented during the data entry stage of iSIKHNAS.

However, data that is incorrect, but within normal ranges can occur. This is very hard to detect because the data appears to be normal. These random errors tend to add "noise" to a dataset that will result in less likelihood of detecting effects when you analyse it. For example, if a cows true weight was 450 kg, but it was recorded incorrectly as 490 kg (due to a typographical error) this would not be detected by iSIKHNAS because it is still a sensible value. This error would be entered into the dataset.

Systematic errors

Another possible problem with data is that there is a systematic error in the way a value is measured. Here the original value was wrong and this value is recorded "correctly" into iSIKHNAS.

For example, veterinarians in a certain area may be very poorly trained at detecting rabies cases. It is possible that many investigations of Gila galak may have an incorrect provisional diagnosis made (for example intoxication instead of biting and behaviour change due to rabies). These errors are then recorded in iSIKHNAS, but they are wrong because the veterinarian originally made a mistake. Clinical tests can also be wrong in the same way.

This bias can have severe consequences to your interpretation of data. These errors are also very difficult to detect. The errors cannot easily be corrected (unless the error rate is known such as occurs with known diagnostic test sensitivity and specificity). These errors produce a bias known as information bias.

This concludes the notes on errors in data and it is time to recommence the case studies with some exercises on errors.