Epidemiological Data Analysis

Revisi per 25 Juli 2014 15.51 oleh Catriona (bicara | kontrib) (Concluding remarks)

Daftar isi

Data analysis using iSIKHNAS data case studies

Introduction

Pre-requisites

This course is for veterinarians within the Indonesian animal health system. It is assumed that participants will have completed the Excel, Basic Field Epidemiology and Surveillance training modules before this course. If you already know about epidemiology and how to use Excel, then these pre-requisite modules are not required.

Objectives of course

The broader aim of the course is help participants make evidence based animal health policy decisions. This will assist them to improve livestock production and health in Indonesia.

To do this, participants need to be able to access, understand and analyse information on animal health in Indonesia. Fortunately, a new initiative in Indonesia means that Indonesian animal health staff has access to one of the best animal health information systems in the world: iSIKHNAS. This provides staff with large amounts of high quality information (data[1]) that they can use to make good animal health decisions. Therefore, the objectives of this course are to teach participants to download, understand, evaluate, analyse and interpret iSIKHNAS data.

Learning approach

This course will be taught by analysing real iSIKHNAS data. Three case studies will be presented. During each case study, a question will be asked. Then the question will be answered by participants during practical exercises. Spaces are included after each exercise where you can write your answers. Answers to exercises are provided as Appendix 2. It is generally recommended that you answer the question before reading the answers.

Sometimes, notes on core concepts will be presented before or during case studies to support learning. These are backgrounded with grey to enable you to distinguish these notes from the exercises.

The course is very applied and relevant to Indonesian animal health staff. The three case studies concentrate on: assessment of veterinary services (staff performance), disease management (diarrhoea in cattle) and livestock production (beef self-sufficiency).

The data used in this training course was downloaded in early 2014. This was when iSIKHNAS had been operating for approximately a year in a small part of Indonesia. This early data was used so that we could provide answers to exercises. You may wish to download newer and more complete data to analyse during exercises at the time of your course. Please be aware that if you do, you will not have answers to check your work.

The interim nature of the data means that no real conclusions can be made about the results of data-analyses conducted during this course. Instead, we conducted the analyses and made conclusions to demonstrate and teach data analysis. Over time, more complete data will be available. Then Indonesians will be able to conduct more complete and accurate analyses.

The course is delivered in Excel. Excel was chosen because it is cheap, available to most Indonesian staff and intuitive. Analyses in Excel will allow staff to make some useful conclusions about iSIKHNAS data.

If you intend to do a lot of important statistical work, you will need to learn how to use a complete statistical package instead of Excel. For this reason, we have also included some extension work for those participants who wish to extend their knowledge beyond Excel. This is presented in Appendix 1. Here R, a free online statistical package is introduced. R is one of the most useful software packages in the world. Better still it is free and downloadable from the internet. Appendix 1 repeats case study 1 in R.

Many screenshot videos will be used to assist you in understanding how to do exercises during the course. These can be played on several different software platforms including Windows media player.

Facilitator material

Powerpoint presentation for Day 1

Powerpoint presentation for Day 2

Powerpoint presentation for Day 3

Deciding which test to use presentation

Training schedule

Overview of data analysis

Relevance of data analysis to animal health policy

In order to make good animal health policy, a veterinarian needs to understand the animal health situation where they work. For example, how much disease is present? Or, what is causing disease and how are various interventions working?

To gain this understanding a veterinarian could guess at the situation or they could make assumptions based on their own experience. These are generally poor means of making decisions. Decisions made by guessing are made in the absence of information. Decisions made on their own experience can be useful but are generally based on a very small amount of experience. That is, decisions are based on the experience of only one veterinarian, even if that veterinarian is very experienced.

A better means of decision making for veterinarians is to make decisions based on information that reflects the broader animal health situation. This information can be received in several ways, such as in animal health data, publications, text books and reports. Fortunately, animal health information (data) is now being collected across much of Indonesia. This data is recorded in iSIKHNAS. This data can assist good decision making if it is analysed and interpreted appropriately. The broad objective of this course is to assist you to do this.

Whilst you will learn a lot in the next several days, it is important that soon after completing this course you begin to download and analyse your own iSIKHNAS data. This will ensure that you consolidate your learning, improve your skills and at the same time improve your evidence based decision making. So please, set a day aside next week to do some of your own data analyses using the skills you learn here. Then regularly do some analysis of iSIKHNAS data. Over time your skills will improve.

Introduction to the basic steps of data analysis

There are several recognised steps to analyse and interpret data. These steps are determining an objective for your analyses, data management, describing data and testing hypotheses. Each of these will be briefly introduced here. Then the rest of the manual uses the four steps in the case studies.

Objective

It is important to have a clear and concise objective for your analysis. For example, what is the prevalence of diarrhoea in cattle for 2014? This then allows you to be focused in your efforts and to source appropriate data to address the objective. An objective is then translated into a hypothesis and tested.

Data management

It is important that veterinarians know how to access iSIKHNAS data and use it. We will help you to download iSIKHNAS data. We will also help you to preserve, error check, create and evaluate the iSIKHNAS data.

Description of data

The next step is to describe the data. One purpose of this step is to further check data for errors. Another purpose is to understand the structure and nature of data and the relationships between different parts of the data. In this step, single variable summaries, summaries of relationships between variables and plots are used. This helps you to start hypothesis testing (step 4).

Hypothesis testing

In order to comprehensively address an animal health question it is important to develop a testable question (or hypothesis) from your objective. This hypothesis can then be tested using appropriate statistical tests and you can decide whether the data supports your idea.

Autoship1256.png

Notes: Key concepts for Sampling

Case study 1: Performance measures for veterinary services

Introduction to case study

It is important that a senior veterinarian at local level be able to assess the performance of local veterinary services. A key function of local veterinary services is to investigate livestock disease syndromes that may indicate outbreaks of important diseases.

One means of reviewing performance of a local veterinary service is to see how rapidly important livestock disease syndromes are investigated. We will use this example for case study 1.

The iSIKHNAS data we will use records the occurrence of a number of priority syndromes such as:

  • Abortion and swollen joints
  • Biting and behavioural change
  • Fever in pigs
  • Limping, drooling and vesicles
  • Poultry mortality
  • Sudden death

These are designed to allow early detection of important diseases such as rabies, vesicular diseases such as Foot-and-Mouth disease, highly pathogenic avian influenza, brucellosis and anthrax.

It is important to note that the data used in this training course was downloaded in early 2014. This data was collected when iSIKHNAS had been operating for approximately a year in a limited area of Indonesia. The incomplete nature of our data means that no real conclusions can be made about the results of data-analyses conducted in this course. However, over time more complete data will be available for Indonesians to conduct better analyses.

Skills to be developed during this case study

  • Develop an objective for a data analysis
  • Download iSIKHNAS data
  • Preserve original data
  • Evaluating data (understand the structure of a dataset, identify errors, understand bias)
  • Creating new data
  • Description of data (measures of central tendency, measures of spread, plotting)

List of files for Case Study 1

Data:

isikhnas_priority disease syndromes_March_2013.csv

Videos:

Case study 1_data download video

Case study 1_exercise 3_missing data video

Case study 1_exercise 5_creating new data video

Case study 1_exercise 6_descriptive statistics video

Case study 1_exercise 6b_descriptive statistics without tool pack video

Case study 1_exercise 7_plotting histograms video

Steps in analysis of Case study 1 (exercises)

We will work though the steps of data analysis for this case study.

Step 1: Objective

Before working with the data it is important to have an idea of the question we wish to answer. We call this an objective. Here our general objective is to assess the veterinary services for the speed at which they investigate reports of priority syndromes. We will refine this general objective during an exercise soon (exercise 2).

Open the iSIKHNAS data (isikhnas_priority disease syndromes_March_2013.csv) and examine it. Figure 3 is a screen shot of the priority syndromes worksheet and similar to what you should see. Exercise 1 concentrates on understanding the data. Exercise 2 concentrates on developing a more useful objective. Complete exercise 1 and then exercise 2.

Exercise 1: Question (examine the data)

Break into groups of 2-4 people and conduct the following exercises. Place your answers in the box below this one.

Carefully examine the worksheet columns and the data contained within each of the columns.

What does each of the columns mean?

How many priority syndrome reports are there in the worksheet?

Think about the objective of the analyses we wish to conduct.

Which data columns do you think are particularly useful to address our objective? As a hint, two important columns are highlighted in Figure 3.

Report your findings back to the larger group.


Exercise 1: Answer (examine the data)

What do each of the columns mean?



How many priority syndrome reports are there in the worksheet?



Which data columns do you think are particularly useful to address our objective? As a hint, two important columns are highlighted in Figure 2.



Answers are provided in Appendix 2. Please write your own answers before checking the answers in Appendix 2. This will assist your learning.

Now that you have examined the data and understand it, we need to develop a more specific and useful objective for our analyses. This is the purpose of Exercise 2.

Exercise 2: Question (outline a detailed objective)

Determine a more detailed analysis objective. Hint, what would the difference between the two highlighted date columns in Figure 2 show? Use this to develop a specific objective for the analysis.

Report back to the group with your refined objective.


Exercise 2: Answer (a specific objective)

Write what you think a more specific objective of the analyses should be.




Figure 3: The Priority syndromes worksheet of the iSIKHNAS data download.

Figure 3 The Priority syndromes.svg

Step 2: Data management

In this data management section we will:

  • Download iSIKHNAS data
  • Preserve data
  • Evaluate data
  • Create new data.
Downloading iSIKHNAS data

This is an important unstructured exercise as it will help you to know how to access iSIKHNAS data for your future work. See the video - Case study 1_data download.avi - for a demonstration of downloading data from iSIKHNAS.

The following steps are how to download iSIKHNAS data.

  1. Go to the website (www.isikhnas.com)
  2. Login with your username (cell number or email) and password
  3. Choose the data set (reports, disease, priority syndrome, then priority disease reports)
  4. Choose the date range you are interested in
  5. Choose the geographic area whose data you wish to examine
  6. Choose run report
  7. Scroll down and click view download and a csv sheet will download
  8. Save this and name it appropriately.

Before continuing the course, pause and practice downloading iSIKHNAS data of relevance to you.

We have downloaded the data you will need for the exercises in this course so we can provide you with exercise answers.

Preserving data

Whenever you download a data file you should save two separate versions of the same file.

One copy should be kept unchanged as an original copy of the data file. You will always have an original and unchanged copy of the data file as it was at the time you downloaded it.

The other copy can then be used as the working file for data analysis. During the analysis process you may make a variety of changes to the data including removing or adding data and variables.

A very good way to do this is to download the original data into a dedicated original data file (for example "Original data" folder in "My Documents"). An alternative practice is to save the original copy with the word _ORIGINAL added to the file name so you can easily identify it. Never work on original data files.

Give each iSIKHNAS download a meaningful name using an appropriate naming convention so you can tell downloads apart. Here we have named our data isikhnas_priority disease syndromes_March_2013.csv.

Having two separate files (working and original) means you can always go back to the original file at any time to start a new analysis or to check what changes have been made in the working copy. If you do go back to the original file to start another analysis, make another working copy of the original file to work on.

See Figure 4 where we have included a screenshot of a possible folder structure with a folder for original data which includes the downloaded Excel file and a separate folder for analysing the data that includes a separate copy of the downloaded data.

Figure 4: Possible folder structure to ensure that original downloaded data is preserved and that analyses occurs in a separate folder with copied data.

Figure 4 Possible folder structure.svg


Autoship1256.png

Notes: Key concepts about data errors

Evaluating data

Identifying errors and other possible problems in the data

Exercise 3: Question (count missing data entries using Excel)

Count the number of missing data points in the Tanggal diinvestigasi column of the isikhnas_priority disease syndromes_March_2013.csv. Hint: use the filter function in Excel to select only the blank cells (missing data), then count them, or use the countif function in Excel.

Hint: Watch this video for assistance.


Exercise 3: Answer (count missing data entries using Excel)

How many missing values did you count with the filter function?



How many missing values did you count with the "countif" formula?



Exercise 4: Question (selection and information bias).

In your groups, discuss what sort of selection and information bias may be present in the iSIKHNAS data.

Hints: Can you think of any syndromic investigations that may be routinely missing from the dataset? What effect would this have on conclusions based on the data that you do have? This is selection bias.

Can you think of any frequent errors that may occur when you are investigating a syndromic event and making a provisional diagnosis? This is information bias.

Exercise 4: Answer (selection and information bias).

Record some notes from the group discussion on how selection and information bias may occur.


Creating new data It is important to be able to create new columns of data (or variables) using existing columns. This will assist you in doing more detailed analyses of iSIKHNAS data. For example, you will need to do this if you want to determine the time between the report date and the investigation date. In the following exercise you will need to create a new column "jam untuk menyiasat" and populate it with the number hours between reporting and investigation.

Exercise 5: Question (creating new data: hours to investigation)

Create a new column that measure the time between reporting and investigation.

Steps (hint):

Change the format of the columns Tanggal laporan and Tanggal diinvestigasi in the isikhnas_priority disease syndromes_March_2013.csv worksheet to date format (if this is required).

Create a new column heading called time to investigate (masa untuk menyiasat).

Create a formula in the column that subtracts the Tanggal laporan date from the Tanggal diinvestigasi to determine the number of days between dates.

Multiply by 24 to determine the number of hours between reporting and investigation (jam untuk menyiasat).

Convert the answer to a new column of values using paste special (values).

Delete the extra columns, just leaving the jam untuk menyiasat column.

Delete the #NUM! error message using the filter function. These are missing values.

You have now created new data that you can use to investigate the time taken to investigate priority syndromes that allows you to assess the performance of the veterinary services.

Hint: Watch a video for a demonstration

Exercise 5: Answer (creating new data: hours to investigation)

Record in brief note form what you found difficult or easy about exercise 5.

This concludes the step 2: data management. We have considered selection and information bias, we have examined errors and we have created new data. Next, we will describe the data.

Step 3: Description of data

The next stage in analysing this data is to examine the jam untuk menyiasat column and describe it. To do this we will first describe the jam untuk menyiasat column with descriptive statistics and then we will plot the data on a graph. However, before we do this we will provide some notes on descriptive statistics.


Autoship1256.png

Notes: Key concepts for descriptive statistics



Describing a single variable

Exercise 6: Question (calculating measures of central tendency and dispersion of jam untuk menyiasat)

Above we have used 5 simulated values to demonstrate calculations of measures of central tendency and dispersion. Now we will calculate these same measures using real data from iSIKHNAS.

Calculate the mean, median, standard deviation, confidence interval, range and interquartile range for the variable jam untuk menyiasat in the Syndromas prioritas worksheet.

Hint:

Use Excel. Especially use the descriptive analyses tool in Excel.

Hint: See video for a demonstration. This video describes how to conduct the analysis using the Analysis ToolPak in Excel.

If you do not have access to the Analysis ToolPak then you may wish to view the following video on how to calculate the values using individual Excel formulas Watch video now.

Exercise 6: Answer (calculating measures of central tendency and dispersion of jam untuk menyiasat)

Write your answer to each of the parameters in the following table:

Exercise 6 Answer table.svg
Plotting

Now we have finished calculating the measures of central tendency and dispersion on jam untuk menyiasat.

There is an indication that the data is not normal. This is indicated because the mean (750 hours) and median (0 hours) are very different. It is important to look at the distribution of the data in order to be able to choose the most appropriate estimates. For example, will we pay attention to the mean or the median in this case?

To do this we will create a histogram of jam untuk menyiasat. This will give us an idea of the distribution of the data. We will do this in Excel in a moment as an exercise, but prior to this we will present the histogram and discuss it. Here is the histogram of jam untuk menyiasat (Figure 4).

Figure 5: A histogram of hours to investigate a priority syndrome disease report.

Figure 5 A histogram of hours to investigate a priority syndrome report.png

By examining the histogram we can see that the jam untuk menyiasat:

  • Ranges from 0 hours to greater than 3261.6 hours.
  • There are 43 priority syndromes where investigation was immediate (0 hours). This indicates that a lot of jam untuk menyiasat times were small. This indicates a good veterinary service.
  • There is a long tail where the time to investigate is long for some reports of a priority disease.
  • The distribution is not normal. The median and the interquartile range are much better descriptive measures than the mean and standard deviation or confidence interval.

Attention is required to see why some times to investigate are very long. These generally occurred in 2013 when iSIKHNAS was new and may indicate an early data recording problem or that investigation procedures have improved since the recording began. However, further efforts to determine the reasons for long investigation times are required.

Exercise 7: Question (histogram of jam untuk menyiasat in Excel)

Create a histogram of jam untuk menyiasat.

Hint: Watch the video for assistance.

Exercise 7: Answer (histogram)

Place the histogram you produced for the hours to investigate in the space below. You may draw it free hand, cut and paste it from the Excel worksheet into this document electronically or print it and physically paste it in to the space.








Step 4: Statistical hypothesis testing

The fourth and final step in data analyses is to conduct hypothesis testing. However, this is not relevant to this case study where we simply wished to examine the time taken to investigate a priority syndrome report. We will conduct hypothesis testing in the next two case studies.

Summary of case study 1

In this case study you first learnt how to download iSIKHNAS data.

You conducted several important steps in data analysis:

  1. Established an objective.
  2. Managed data (by backing up your original data, examining the data for errors and creating new data).
  3. Described data.

These three common steps should always be undertaken no matter what data you have.

It is important to note that sometimes the way you describe data (step 3) will differ. For example here you used measures of central tendency (e.g. a median) and measure of dispersion (Q1:Q3) to describe the data. This sort of description is only useful for a continuous variable. Other types of data will need different approaches to describe them (see Case study 2 where categorical data is described with contingency tables). Regardless of the type of data, you should always try to describe the data by plotting it on a graph.

In this case study it was not necessary to conduct the 4th step (hypothesis testing) as we simply wished to explore the time it took to investigate a priority disease report.

This ends case study 1.

Case study 2: Seasonal prevalence of diarrhoea (Mencret) in cattle

Introduction to case study

Diarrhoea is a clinical sign of many serious production diseases, for example internal parasites. Here we will examine the prevalence of diarrhoea in cattle throughout the year to determine if there is a seasonal peak in the amount of diarrhoea. This will allow you to plan for Mencret. For example we can identify if and when an annual peak of cases is observed each year. This may allow you to request anthelmintic supplies before the peak or implement a preventative plan in advance of the seasonal peak. The relevant data is ‘Cattle clinical signs.xls’.

It is important to note that the data used in this training course was downloaded in early 2014. This data was collected when iSIKHNAS had been operating for approximately a year in a limited area of Indonesia. The incomplete nature of our data means that no real conclusions can be made about the results of data-analyses conducted in this course. However, over time more complete data will be available for Indonesians to conduct better analyses.

Skills to be developed during this case study

  • Evaluation of data
  • Creation of new data
  • Describe data using contingency tables and plots
  • Hypothesis testing using odds ratios and confidence intervals

List of files for case study

Data

Cattle clinical signs.xlsx

Videos

Case study 2_exercise 10 video

Case study 2_exercise 11 video

Case study 2_exercise 12 video

Case study 2_exercise 13 video

Case study 2_exercise 14 video

Steps in analysis of Case study 2 (exercises)

Step 1: Objective

The objective in this case study is to determine whether there is a difference in prevalence of cattle with diarrhoea between the dry and wet season. We need this information so that we can potentially improve veterinary management.

For example we may wish to order medicines in time for a seasonal peak in diarrhoea. We may also wish to implement management programs at the correct time of year to reduce diarrhoea. To do this we have to know if there is a seasonal peak and if there is, when this occurs.

There are no exercises on objectives in this case study.

Step 2: Data management

The relevant dataset is Cattle clinical signs.xlsx.

The first step is to preserve your data. Make a copy to work on and archive the original file.

Next we will evaluate the data. We will decide which data is most relevant and we will check for problems in the data (errors and bias). We will then decide whether we need to create additional data to address the objective.

What data is important?

There are many columns of data (variables). We will start by thinking about which data columns are important in the Cattle clinical signs.csv. There are two sorts of data that we need:

  1. Some way of calculating the amount of disease. This implies we need both numerator data (the number of cases of diarrhoea cattle) and denominator data (the number of cattle at risk of disease)
  2. Change of disease over time. We need an idea of when observations of sick cattle occurred.

Exercise 8: Question (Determine which variables are important to address our objectives in the Cattle clinical signs.xlsx).

Hint:

Examine columns that represent the Tanda (signs) being reported and Diagnosa sementara (provisional diagnosis). Which do you think would be most useful to indicate the possible diagnosis of cattle with diarrhoea? Is there data that indicates when infection occurred? What denominator data is present?

Exercise 8: Answer (Determine which variables are important to address our objectives in the Cattle clinical signs.xlsx).

Write notes based on group discussions on which data columns are most important and why?

After you have discussed the issues in your groups, refer to the answers provided to assist your answering of this question. Write the answer here.


Identifying errors and other possible problems in the data

It is important to again determine whether there is any missing data and to think about whether there are any biases (selection or information) or errors in the data. We will do this as exercises as we did in in the first case study.

Exercise 9: Question (examine the data for errors and think about biases that may be present)

Hint:

Use the filter function in Excel to select only the blank cells (missing data), then count them, or use the countif function in Excel. Do this for the two variables of interest.

Whilst using the filter function, examine all the categories in the two columns of interest that were outlined in the exercise above. Are any categories unusual or potentially errors?

What would you suggest we do about the errors (if any were detected)?

Are there any biases likely in the data?

Exercise 9: Answer (examine the data for errors and think about biases that may be present)

Did you detect errors (yes/no)?

Write notes on group discussions of any biases present.



Creating new data to facilitate later analyses

In order to conduct analyses we will have to create some additional data. Do exercise 10 to create this data.

Exercise 10: Question (create new data)

We need to create some new data in order to analyse our data optimally. Think about which data is required.

Hint: We need to know whether the season is the wet or dry season and whether the clinical signs recorded are diarrhoea or other clinical signs. This will require the use of formulas to populate three new columns (variables), month of year, season and diarrhoea/other clinical sign. How would you do this in Excel and where would the information come from in the worksheet?

After you have discussed this, see the answers document for the general approach (and formulas) and watch the video on exercise 10.

Watch the video now.

Exercise 10: Answer (create new data)

Write notes on the video exercise if you would like to.


Step 3: Description of data

This stage is important to further check for errors and to begin to get an understanding of relationships between variables. This understanding will help us to answer our question of interest.

Descriptive of data

We will first examine and summarise each of the columns of interest. To do this will require Pivot tables in Excel. This will be done as an exercise.

Exercise 11: Question (summarise each of the columns of interest (month, season and diarrhoea?))

Establish 3 contingency tables, one for count of clinical signs (diarrhoea or otherwise, one for count of season (wet or dry) and another for count of month of the year.

Hint:

Each of the important data columns contains very simple data, either one of 12 months for the month integer column or wet/dry for Season and Diarrhoea/Other for the Clinical sign column. The previously used data analysis tool which calculated measures of central tendency and dispersion is not really relevant.

In this instance the pivot table option will allow you to summarise each of the categories. This will give you the number of observations in each of the categories (e.g. wet versus dry season).

Watch the video now.

Exercise 11: Answer (Summarise each of the columns of interest (month, season and diarrhoea?))

Insert your pivot table results into the section below.


Relationships between variables

Contingency tables

The next step is to begin to understand the relationships between different variables (e.g. clinical signs by season). This is done prior to formal hypothesis testing. In this instance the data we have is categorical (wet/dry season or month and diarrhoea/other). This suggests that contingency tables will again be very useful. Contingency tables subtotal categories of data and are ideal to calculate measures of association (later).

Exercise 12: Question (contingency tables)

Establish two contingency tables (pivot tables) in Excel and examine the data.

Hint:

The contingency tables should look as follows.

Exercise 12 Diarrhoea by season and month.svg

Hint:

Use the demonstration video to complete the exercise and populate the contingency tables.

Watch the video now.

Does there appear to be a relationship between season or month and diarrhoea?


Exercise 12: Answer (contingency tables)

Paste your two contingency tables below and examine the relationship between time of year and diarrhoea.



Completed contingency tables for discussion.

Below we have completed the contingency tables.

It appears that there may be a slight increase in the proportion of sick cattle that have diarrhoea in the wet season (27 verse 22%). It appears also that the prevalence of sick cattle with diarrhoea may increase during the year, with the middle dry season months having relatively low prevalence of diarrhoea.

However, results are very variable. Later we will use statistical hypothesis testing to allow more rigorous conclusions.

Exercise 12 Diarrhoea by season and month 2.svg

Plotting

It is very important to visualise the data in the form of plots since plots give a lot of information to the viewer.

This is important for the contingency table of month against diarrhoea. We will examine this data using scatterplots and lines of best fit in Excel. A scatterplot is used to plot data that has two variables associated with every observation. Usually, ordinal or continuous variables are required to plot data in a scatter plot.

Here, the prevalence of diarrhoea in all sick cattle presented for a month is one continuous variable (i.e. a percentage that varies from 0-100). We can plot months as an ordinal variable. This means we have an ordinal and continuous variable and we can therefore create a scatterplot with month on one axis (x axis) and prevalence of diarrhoea in sick cattle as another variable (y axis). This will allow us to examine the data and see visually whether the prevalence of diarrhoea cattle changes throughout the year.

Exercise 13: Question (plots of mencret prevalence by month)

Create a scatterplot of month against prevalence of diarrhoea in all sick cattle.

Hint:

Use the contingency table of month against Mencret above to populate the plot. Watch the demonstration video before attempting this.

Watch the video now.


Exercise 13: Answer (plots of mencret prevalence by month)

Include your scatterplot of month against prevalence of diarrhoea in all sick cattle below.




We have included the completed plot below (Figure 6). Some points to make on scatterplot design. The independent variable goes on the horizontal axis and the dependant variable on the vertical axis. The independent variable (month) is assumed to affect the dependant variable (prevalence). We need to include axis titles. We have included a line of best fit which here is a moving average. This plot helps by giving us more information about the relationship between season and prevalence of diarrhoea cattle. The line of best fit goes down in the dry season indicating there may be a relationship between diarrhoea and month of the year. The lowest prevalence appears to be in the dry season around months 4-8 with increasing prevalence in the wet season.


Figure 6

Figure 6.svg


Autoship1256.png

Notes: Key concepts of Measures of Disease



Step 4: Statistical hypothesis testing

Measure of association and hypothesis testing

There are two common ways of testing hypotheses using standard methods in veterinary epidemiology. This is with measures of association or statistical hypothesis testing. Here we will use a measure of association to answer our question. We will use a more traditional statistical hypothesis testing approach in the next case study (case study 3). In case study 3 we will also provide a comprehensive set of background knowledge notes to introduce the principals of statistical hypothesis testing.

In this case study we will establish a hypothesis, calculate a measure of association (and confidence interval) and interpret the results.

The question of interest here is whether there is a difference in diarrhoea prevalence between the wet or dry season. It is always important to first establish the null hypothesis. The null hypothesis is always the hypothesis of no effect. The null hypothesis is rejected or retained based on the value of the measure of association (and associated confidence interval).

Null hypothesis= Ho = There is no difference between the odds of diarrhoea between the wet and dry season.

The alternative hypothesis is generally opposite the null hypothesis. The alternative hypothesis is tentatively accepted if you have enough statistical evidence to reject the null hypothesis. Here the alternative hypothesis is:

HA = There is a difference between the odds of diarrhoea between the wet and dry season.

Exercise 14: Measure of association between Mencret and season.

Calculate a measure of association and confidence interval between diarrhoea and season.

Hint:

Calculate an odds ratio using the data from the two way contingency table of season and diarrhoea from Exercise 12.

We will be using Epitools for this exercise (http://epitools.ausvet.com.au/content.php?page=home). Please browse the Epitools website. Also, watch the video which details Epitools and how to use Epitools to calculate measures of association.

Watch the video now


Exercise 14: Measure of association between Mencret and season.

Place your epitools results here.



Here we will calculate the odds ratio by hand to assist you in interpreting the exercise. Here is the contingency table.


Dry Season
Wet Season
Diarrhoea
246
544
Other
875
1437
Total cattle
1121
1981
Prevalence of Mencret in sick cattle (%)
22
27

Odds of diarrhoea in the wet season:

544/1437 = 0.3786

Odds of diarrhoea in dry season:

246/875 =0.2811

Odds ratio:

0.3786/0.2811 = 1.35

We also know the 95% confidence interval is the interval 1.13 - 1.60. This means there is an association between season and diarrhoea. We thus reject the null hypothesis and infer that the alternative hypothesis is supported by the data: namely that season does have a role in the proportion of diarrhoea in passively surveyed cattle. However, it is important to note that this is simply a statistical association between the wet season and diarrhoea. It does not prove that the wet season causes diarrhoea. Instead it indicates that this may be possible. Further investigation and study would be required to determine that any factor actually causes disease.

Summary of case study 2

In this case study you again conducted several important steps in data analysis:

  1. An objective was presented to you
  2. You managed data (by backing up your original data, examining the data for errors and creating new data)
  3. You described data
  4. You conducted hypothesis testing using odds ratios and confidence intervals.

This time you described data by establishing contingency tables (called pivot tables in Excel) and by creating scatterplots and lines of best fit. These steps were almost enough to answer your research question of whether there was a change in diarrhoea between the wet and dry season. It appeared that there was a drop in diarrhoea in the dry season on the plot. We conducted some hypothesis testing using a measure of association and confidence interval to account for the effect of sampling variability so we could be sure what we saw in the tables and plots was most likely a real effect.


This is the end of case study 2.

Case study 3: Beef self-sufficiency (based on slaughter statistics)

Introduction to case study

The number, type and date of livestock slaughtered at abattoirs in Indonesia are recorded in iSIKHNAS. We will examine the cattle records.

We will determine whether there are changes in the number of different classes (e.g. female productive versus Australian imported cattle) of cattle slaughtered over the collection period of iSIKHNAS. We can make tentative conclusions about Indonesia’s quest for beef self-sufficiency based on changes in the slaughter of different categories of cattle.

It is important to note that the data used in this training course was downloaded in early 2014. This data was collected when iSIKHNAS had been operating for approximately a year in a limited area of Indonesia. The incomplete nature of our data means that no real conclusions can be made about the results of data-analyses conducted in this course. However, over time more complete data will be available for Indonesians to conduct better analyses.

Skills to be developed during this case study

  • Data management
  • Summarise data with plots
  • Understanding of null and alternative hypotheses
  • Hypothesis testing using a statistical test

List of files for case study

Data

Abattoir_April_2014.xlsx

Videos

Case 3 Exercise 16 video

Case 3 Exercise 17 video

Case 3 Exercise 18 video


Autoship1256.png

Notes: Key concepts for hypothesis testing and Case Study 3

Steps in analysis of Case study 3 (exercises)

Step 1: Objective

Exercise 15:

Break into small groups and discuss the objective of the case study. Do the hypotheses support the objective of the analyses?

Report back to the group with your conclusions.

Hint:

The broader focus of the analysis is to address an issue of concern for Indonesia, namely achievement of self-sufficiency in beef production. One way to understand whether self-sufficiency is increasing is to examine slaughter statistics.

Hypotheses are:

Null hypothesis: Month and slaughter categories are independent. That is, there is no effect of month on total slaughter numbers in different categories of slaughtered cattle.

Alternative hypothesis = Month and slaughter categories are dependant. That is, there is an effect of month on total slaughter numbers in different categories of slaughtered cattle.

Are these hypotheses adequate?

Exercise 15: Do the hypotheses support the objective of the analyses?

Write brief notes on group discussions of this topic.

Step 2: Data management

Back up your data before you alter it!

Exercise 16: Question (create a pivot table of cattle slaughter statistics by month).

Hint:

Select all the cattle data from the Excel worksheet titled Abattoir 2014.xls and save in a fresh work sheet. Open cattle slaughter data, error check and create a contingency table (pivot table in Excel).

Watch video now. Note you have already done some of this in previous exercises.

Exercise 16: Answer (create a pivot table of cattle slaughter statistics by month).

Include the pivot table you created below.


Step 3: Description of data

Exercise 17: Question (summary plots)

Do some summary plots for the contingency table. The aim is to understand the data before hypothesis testing.

Hint: Watch the video for assistance.


Exercise 17: Answer (summary plots)

Include the summary plots you created below.

Step 4: Statistical hypothesis testing

Exercise 18: Question (chi-squared analysis of slaughter data)

Conduct a chi-squared statistical test on the pivot table.

Hint: Watch this video now.


Exercise 18: Answer (chi-squared analysis of slaughter data)

Include your chi-squared test results below.


Exercise 19: Question (inference)

Break into your groups and discuss the findings of the significant chi-squared test. What does it mean? That is, make some inferences.

Hint:

Look at the result of the statistical test. Look at the summary plots from exercise 17 and look at the expected verse observed values from exercise 18. Where are the major differences between expected and observed values that are most contributing to the size of the chi-squared statistic? These are the areas where the effect is largest.


Exercise 19: Answer (inference)

Write brief notes on group discussions on inference about the significant chi-squared test result for the slaughter data.


Summary of case study 3

In this case study you were first presented with information on the basic statistical hypothesis testing approach. This can be complex, but is a consistent approach used in many studies. Then you again conducted several important steps in data analysis:

  1. The hypotheses were checked to see that they supported the objective
  2. Managed data (by backing up your original data, examining the data for errors and creating new data)
  3. Described data with contingency tables and plots
  4. Conducted hypothesis testing

This time you conducted hypothesis testing with a formal statistical approach using a chi-squared test. The major steps in the hypothesis testing approach can be used for most datasets. You now have the general skills to do this. With time and experience you will improve your knowledge of which statistical test to use and how to apply them.

This ends case study 3.

Concluding remarks

Data analysis is critical to good animal health management and policy formation. Indonesia is fortunate to have an excellent and newly functional animal health information system, iSIKHNAS. However, it is no use having one of the world's best information systems if no-one uses the data. Hence, this course has focused on helping you to begin to use iSIKHNAs data. These approaches may help you to improve animal health decision making.

There are several standard steps to data analysis, including developing a research question of interest (objective), data management, description and hypothesis testing. All of these steps were used in this course. In order to consolidate your new skills it is recommended that you apply these steps to questions of interest to you immediately upon your return to your workplace. For example if you spend a day or two per month for the next several months you will consolidate your new skills.

We have only had time to use a fraction of the available statistical approaches (e.g. measures of association and a chi-squared test). You may choose to expand your knowledge beyond what we have learnt in this course. There are a number of useful text books that can assist you in developing your statistical skills. Veterinary Epidemiologic Research (Dohoo et al., 2009) is a useful text book that covers veterinary epidemiology and presents some statistics in this field. Statistics for Veterinary and Animal Science (Petrie and Watson, 2006) is a useful text book that applies statistics more generally to veterinary science. If you decide that you really want to learn both R and expand your statistical knowledge you should complete the appendix on R. Introductory Statistics with R (Dalgaard, 2008) is a very useful text book that introduces both statistics and R. If you start using R, then the statistical world will be at your figure tips!

Appendix 1: Extension work using R.

Themes256.png

This appendix introduces the reader to R. It is for those course participants who do a substantial amount of statistical analyses (or aim to) and wish to improve their capability beyond what Excel allows. There may not be time to complete this appendix during the course, but it can be completed later.

R is a statistical and graphics software environment. It is widely recognised and used throughout the world. R provides a very powerful and flexible environment in which to conduct simple or complex statistics or to produce publication ready graphics. It is also free, and constantly updated. It is widely supported by a range of scientists who are constantly contributing new packages that expand the capability of R. R will be around for the long term, so any effort you invest in learning the software will be useful throughout your career. As there are so many avid users around the world there are many online help sites and discussion boards with free code and advice.

Read more...

References

Dalgaard, P., 2008. Introductory Statistics with R. Springer.

Dohoo, I., Martin, W., Stryhn, H., 2009. Veterinary Epidemiologic Research. VER Charlottetown.

Petrie, A., Watson, P., 2006. Statistics for Veterinary and Animal Science. Wiley.




  1. The Oxford dictionary definition of data is: ‘Facts and statistics collected together for reference or analysis’