Epidemiological Data Analysis: Perbedaan revisi

(Appendix 2: Answers to exercises in data training manual)
(Step 1: Objective)
Baris 177: Baris 177:
 
'''Figure 3: The Priority syndromes worksheet of the iSIKHNAS data download.'''
 
'''Figure 3: The Priority syndromes worksheet of the iSIKHNAS data download.'''
  
[[Image:Figure 3 The Priority syndromes.svg]]}}
+
[[Image:Figure 3 The Priority syndromes.svg|500px|center]]}}
  
 
==== Step 2: Data management  ====
 
==== Step 2: Data management  ====

Revisi per 25 Juli 2014 14.52

Daftar isi

Data analysis using iSIKHNAS data case studies

Introduction

Pre-requisites

This course is for veterinarians within the Indonesian animal health system. It is assumed that participants will have completed the Excel, Basic Field Epidemiology and Surveillance training modules before this course. If you already know about epidemiology and how to use Excel, then these pre-requisite modules are not required.

Objectives of course

The broader aim of the course is help participants make evidence based animal health policy decisions. This will assist them to improve livestock production and health in Indonesia.

To do this, participants need to be able to access, understand and analyse information on animal health in Indonesia. Fortunately, a new initiative in Indonesia means that Indonesian animal health staff has access to one of the best animal health information systems in the world: iSIKHNAS. This provides staff with large amounts of high quality information (data[1]) that they can use to make good animal health decisions. Therefore, the objectives of this course are to teach participants to download, understand, evaluate, analyse and interpret iSIKHNAS data.

Learning approach

This course will be taught by analysing real iSIKHNAS data. Three case studies will be presented. During each case study, a question will be asked. Then the question will be answered by participants during practical exercises. Spaces are included after each exercise where you can write your answers. Answers to exercises are provided as Appendix 2. It is generally recommended that you answer the question before reading the answers.

Sometimes, notes on core concepts will be presented before or during case studies to support learning. These are backgrounded with grey to enable you to distinguish these notes from the exercises.

The course is very applied and relevant to Indonesian animal health staff. The three case studies concentrate on: assessment of veterinary services (staff performance), disease management (diarrhoea in cattle) and livestock production (beef self-sufficiency).

The data used in this training course was downloaded in early 2014. This was when iSIKHNAS had been operating for approximately a year in a small part of Indonesia. This early data was used so that we could provide answers to exercises. You may wish to download newer and more complete data to analyse during exercises at the time of your course. Please be aware that if you do, you will not have answers to check your work.

The interim nature of the data means that no real conclusions can be made about the results of data-analyses conducted during this course. Instead, we conducted the analyses and made conclusions to demonstrate and teach data analysis. Over time, more complete data will be available. Then Indonesians will be able to conduct more complete and accurate analyses.

The course is delivered in Excel. Excel was chosen because it is cheap, available to most Indonesian staff and intuitive. Analyses in Excel will allow staff to make some useful conclusions about iSIKHNAS data.

If you intend to do a lot of important statistical work, you will need to learn how to use a complete statistical package instead of Excel. For this reason, we have also included some extension work for those participants who wish to extend their knowledge beyond Excel. This is presented in Appendix 1. Here R, a free online statistical package is introduced. R is one of the most useful software packages in the world. Better still it is free and downloadable from the internet. Appendix 1 repeats case study 1 in R.

Many screenshot videos will be used to assist you in understanding how to do exercises during the course. These can be played on several different software platforms including Windows media player.

Facilitator material

Powerpoint presentation for Day 1

Powerpoint presentation for Day 2

Powerpoint presentation for Day 3

Deciding which test to use presentation

Training schedule

Overview of data analysis

Relevance of data analysis to animal health policy

In order to make good animal health policy, a veterinarian needs to understand the animal health situation where they work. For example, how much disease is present? Or, what is causing disease and how are various interventions working?

To gain this understanding a veterinarian could guess at the situation or they could make assumptions based on their own experience. These are generally poor means of making decisions. Decisions made by guessing are made in the absence of information. Decisions made on their own experience can be useful but are generally based on a very small amount of experience. That is, decisions are based on the experience of only one veterinarian, even if that veterinarian is very experienced.

A better means of decision making for veterinarians is to make decisions based on information that reflects the broader animal health situation. This information can be received in several ways, such as in animal health data, publications, text books and reports. Fortunately, animal health information (data) is now being collected across much of Indonesia. This data is recorded in iSIKHNAS. This data can assist good decision making if it is analysed and interpreted appropriately. The broad objective of this course is to assist you to do this.

Whilst you will learn a lot in the next several days, it is important that soon after completing this course you begin to download and analyse your own iSIKHNAS data. This will ensure that you consolidate your learning, improve your skills and at the same time improve your evidence based decision making. So please, set a day aside next week to do some of your own data analyses using the skills you learn here. Then regularly do some analysis of iSIKHNAS data. Over time your skills will improve.

Introduction to the basic steps of data analysis

There are several recognised steps to analyse and interpret data. These steps are determining an objective for your analyses, data management, describing data and testing hypotheses. Each of these will be briefly introduced here. Then the rest of the manual uses the four steps in the case studies.

Objective

It is important to have a clear and concise objective for your analysis. For example, what is the prevalence of diarrhoea in cattle for 2014? This then allows you to be focused in your efforts and to source appropriate data to address the objective. An objective is then translated into a hypothesis and tested.

Data management

It is important that veterinarians know how to access iSIKHNAS data and use it. We will help you to download iSIKHNAS data. We will also help you to preserve, error check, create and evaluate the iSIKHNAS data.

Description of data

The next step is to describe the data. One purpose of this step is to further check data for errors. Another purpose is to understand the structure and nature of data and the relationships between different parts of the data. In this step, single variable summaries, summaries of relationships between variables and plots are used. This helps you to start hypothesis testing (step 4).

Hypothesis testing

In order to comprehensively address an animal health question it is important to develop a testable question (or hypothesis) from your objective. This hypothesis can then be tested using appropriate statistical tests and you can decide whether the data supports your idea.

Autoship1256.png

Notes: Key concepts for Sampling

Case study 1: Performance measures for veterinary services

Introduction to case study

It is important that a senior veterinarian at local level be able to assess the performance of local veterinary services. A key function of local veterinary services is to investigate livestock disease syndromes that may indicate outbreaks of important diseases.

One means of reviewing performance of a local veterinary service is to see how rapidly important livestock disease syndromes are investigated. We will use this example for case study 1.

The iSIKHNAS data we will use records the occurrence of a number of priority syndromes such as:

  • Abortion and swollen joints
  • Biting and behavioural change
  • Fever in pigs
  • Limping, drooling and vesicles
  • Poultry mortality
  • Sudden death

These are designed to allow early detection of important diseases such as rabies, vesicular diseases such as Foot-and-Mouth disease, highly pathogenic avian influenza, brucellosis and anthrax.

It is important to note that the data used in this training course was downloaded in early 2014. This data was collected when iSIKHNAS had been operating for approximately a year in a limited area of Indonesia. The incomplete nature of our data means that no real conclusions can be made about the results of data-analyses conducted in this course. However, over time more complete data will be available for Indonesians to conduct better analyses.

Skills to be developed during this case study

  • Develop an objective for a data analysis
  • Download iSIKHNAS data
  • Preserve original data
  • Evaluating data (understand the structure of a dataset, identify errors, understand bias)
  • Creating new data
  • Description of data (measures of central tendency, measures of spread, plotting)

List of files for Case Study 1

Data:

isikhnas_priority disease syndromes_March_2013.csv

Videos:

Case study 1_data download video

Case study 1_exercise 3_missing data video

Case study 1_exercise 5_creating new data video

Case study 1_exercise 6_descriptive statistics video

Case study 1_exercise 6b_descriptive statistics without tool pack video

Case study 1_exercise 7_plotting histograms video

Steps in analysis of Case study 1 (exercises)

We will work though the steps of data analysis for this case study.

Step 1: Objective

Before working with the data it is important to have an idea of the question we wish to answer. We call this an objective. Here our general objective is to assess the veterinary services for the speed at which they investigate reports of priority syndromes. We will refine this general objective during an exercise soon (exercise 2).

Open the iSIKHNAS data (isikhnas_priority disease syndromes_March_2013.csv) and examine it. Figure 3 is a screen shot of the priority syndromes worksheet and similar to what you should see. Exercise 1 concentrates on understanding the data. Exercise 2 concentrates on developing a more useful objective. Complete exercise 1 and then exercise 2.

Exercise 1: Question (examine the data)

Break into groups of 2-4 people and conduct the following exercises. Place your answers in the box below this one.

Carefully examine the worksheet columns and the data contained within each of the columns.

What does each of the columns mean?

How many priority syndrome reports are there in the worksheet?

Think about the objective of the analyses we wish to conduct.

Which data columns do you think are particularly useful to address our objective? As a hint, two important columns are highlighted in Figure 3.

Report your findings back to the larger group.


Exercise 1: Answer (examine the data)

What do each of the columns mean?



How many priority syndrome reports are there in the worksheet?



Which data columns do you think are particularly useful to address our objective? As a hint, two important columns are highlighted in Figure 2.



Answers are provided in Appendix 2. Please write your own answers before checking the answers in Appendix 2. This will assist your learning.

Now that you have examined the data and understand it, we need to develop a more specific and useful objective for our analyses. This is the purpose of Exercise 2.

Exercise 2: Question (outline a detailed objective)

Determine a more detailed analysis objective. Hint, what would the difference between the two highlighted date columns in Figure 2 show? Use this to develop a specific objective for the analysis.

Report back to the group with your refined objective.


Exercise 2: Answer (a specific objective)

Write what you think a more specific objective of the analyses should be.




Figure 3: The Priority syndromes worksheet of the iSIKHNAS data download.

Figure 3 The Priority syndromes.svg

Step 2: Data management

In this data management section we will:

  • Download iSIKHNAS data
  • Preserve data
  • Evaluate data
  • Create new data.
Downloading iSIKHNAS data

This is an important unstructured exercise as it will help you to know how to access iSIKHNAS data for your future work. See the video - Case study 1_data download.avi - for a demonstration of downloading data from iSIKHNAS.

The following steps are how to download iSIKHNAS data.

  1. Go to the website (www.isikhnas.com)
  2. Login with your username (cell number or email) and password
  3. Choose the data set (reports, disease, priority syndrome, then priority disease reports)
  4. Choose the date range you are interested in
  5. Choose the geographic area whose data you wish to examine
  6. Choose run report
  7. Scroll down and click view download and a csv sheet will download
  8. Save this and name it appropriately.

Before continuing the course, pause and practice downloading iSIKHNAS data of relevance to you.

We have downloaded the data you will need for the exercises in this course so we can provide you with exercise answers.

Preserving data

Whenever you download a data file you should save two separate versions of the same file.

One copy should be kept unchanged as an original copy of the data file. You will always have an original and unchanged copy of the data file as it was at the time you downloaded it.

The other copy can then be used as the working file for data analysis. During the analysis process you may make a variety of changes to the data including removing or adding data and variables.

A very good way to do this is to download the original data into a dedicated original data file (for example "Original data" folder in "My Documents"). An alternative practice is to save the original copy with the word _ORIGINAL added to the file name so you can easily identify it. Never work on original data files.

Give each iSIKHNAS download a meaningful name using an appropriate naming convention so you can tell downloads apart. Here we have named our data isikhnas_priority disease syndromes_March_2013.csv.

Having two separate files (working and original) means you can always go back to the original file at any time to start a new analysis or to check what changes have been made in the working copy. If you do go back to the original file to start another analysis, make another working copy of the original file to work on.

See Figure 4 where we have included a screenshot of a possible folder structure with a folder for original data which includes the downloaded Excel file and a separate folder for analysing the data that includes a separate copy of the downloaded data.

Figure 4: Possible folder structure to ensure that original downloaded data is preserved and that analyses occurs in a separate folder with copied data.

Figure 4 Possible folder structure.svg


Themes256.png

Notes: Key concepts about data errors

Evaluating data

Identifying errors and other possible problems in the data

Exercise 3: Question (count missing data entries using Excel)

Count the number of missing data points in the Tanggal diinvestigasi column of the isikhnas_priority disease syndromes_March_2013.csv. Hint: use the filter function in Excel to select only the blank cells (missing data), then count them, or use the countif function in Excel.

Hint: Watch this video for assistance.


Exercise 3: Answer (count missing data entries using Excel)

How many missing values did you count with the filter function?



How many missing values did you count with the "countif" formula?



Exercise 4: Question (selection and information bias).

In your groups, discuss what sort of selection and information bias may be present in the iSIKHNAS data.

Hints: Can you think of any syndromic investigations that may be routinely missing from the dataset? What effect would this have on conclusions based on the data that you do have? This is selection bias.

Can you think of any frequent errors that may occur when you are investigating a syndromic event and making a provisional diagnosis? This is information bias.

Exercise 4: Answer (selection and information bias).

Record some notes from the group discussion on how selection and information bias may occur.


Creating new data It is important to be able to create new columns of data (or variables) using existing columns. This will assist you in doing more detailed analyses of iSIKHNAS data. For example, you will need to do this if you want to determine the time between the report date and the investigation date. In the following exercise you will need to create a new column "jam untuk menyiasat" and populate it with the number hours between reporting and investigation.

Exercise 5: Question (creating new data: hours to investigation)

Create a new column that measure the time between reporting and investigation.

Steps (hint):

Change the format of the columns Tanggal laporan and Tanggal diinvestigasi in the isikhnas_priority disease syndromes_March_2013.csv worksheet to date format (if this is required).

Create a new column heading called time to investigate (masa untuk menyiasat).

Create a formula in the column that subtracts the Tanggal laporan date from the Tanggal diinvestigasi to determine the number of days between dates.

Multiply by 24 to determine the number of hours between reporting and investigation (jam untuk menyiasat).

Convert the answer to a new column of values using paste special (values).

Delete the extra columns, just leaving the jam untuk menyiasat column.

Delete the #NUM! error message using the filter function. These are missing values.

You have now created new data that you can use to investigate the time taken to investigate priority syndromes that allows you to assess the performance of the veterinary services.

Hint: Watch a video for a demonstration

Exercise 5: Answer (creating new data: hours to investigation)

Record in brief note form what you found difficult or easy about exercise 5.

This concludes the step 2: data management. We have considered selection and information bias, we have examined errors and we have created new data. Next, we will describe the data.

Step 3: Description of data

The next stage in analysing this data is to examine the jam untuk menyiasat column and describe it. To do this we will first describe the jam untuk menyiasat column with descriptive statistics and then we will plot the data on a graph. However, before we do this we will provide some notes on descriptive statistics.


Themes256.png

Notes: Key concepts for descriptive statistics



Describing a single variable

Exercise 6: Question (calculating measures of central tendency and dispersion of jam untuk menyiasat)

Above we have used 5 simulated values to demonstrate calculations of measures of central tendency and dispersion. Now we will calculate these same measures using real data from iSIKHNAS.

Calculate the mean, median, standard deviation, confidence interval, range and interquartile range for the variable jam untuk menyiasat in the Syndromas prioritas worksheet.

Hint:

Use Excel. Especially use the descriptive analyses tool in Excel.

Hint: See video for a demonstration. This video describes how to conduct the analysis using the Analysis ToolPak in Excel.

If you do not have access to the Analysis ToolPak then you may wish to view the following video on how to calculate the values using individual Excel formulas Watch video now.

Exercise 6: Answer (calculating measures of central tendency and dispersion of jam untuk menyiasat)

Write your answer to each of the parameters in the following table:

Exercise 6 Answer table.svg

Plotting

Now we have finished calculating the measures of central tendency and dispersion on jam untuk menyiasat.

There is an indication that the data is not normal. This is indicated because the mean (750 hours) and median (0 hours) are very different. It is important to look at the distribution of the data in order to be able to choose the most appropriate estimates. For example, will we pay attention to the mean or the median in this case?

To do this we will create a histogram of jam untuk menyiasat. This will give us an idea of the distribution of the data. We will do this in Excel in a moment as an exercise, but prior to this we will present the histogram and discuss it. Here is the histogram of jam untuk menyiasat (Figure 4).

Figure 5: A histogram of hours to investigate a priority syndrome disease report.

Figure 5 A histogram of hours to investigate a priority syndrome report.png

By examining the histogram we can see that the jam untuk menyiasat:

  • Ranges from 0 hours to greater than 3261.6 hours.
  • There are 43 priority syndromes where investigation was immediate (0 hours). This indicates that a lot of jam untuk menyiasat times were small. This indicates a good veterinary service.
  • There is a long tail where the time to investigate is long for some reports of a priority disease.
  • The distribution is not normal. The median and the interquartile range are much better descriptive measures than the mean and standard deviation or confidence interval.

Attention is required to see why some times to investigate are very long. These generally occurred in 2013 when iSIKHNAS was new and may indicate an early data recording problem or that investigation procedures have improved since the recording began. However, further efforts to determine the reasons for long investigation times are required.

Exercise 7: Question (histogram of jam untuk menyiasat in Excel)

Create a histogram of jam untuk menyiasat.

Hint: Watch the video for assistance.


Exercise 7: Answer (histogram)

Place the histogram you produced for the hours to investigate in the space below. You may draw it free hand, cut and paste it from the Excel worksheet into this document electronically or print it and physically paste it in to the space.








Step 4: Statistical hypothesis testing

The fourth and final step in data analyses is to conduct hypothesis testing. However, this is not relevant to this case study where we simply wished to examine the time taken to investigate a priority syndrome report. We will conduct hypothesis testing in the next two case studies.

Summary of case study 1

In this case study you first learnt how to download iSIKHNAS data.

You conducted several important steps in data analysis:

  1. Established an objective.
  2. Managed data (by backing up your original data, examining the data for errors and creating new data).
  3. Described data.

These three common steps should always be undertaken no matter what data you have.

It is important to note that sometimes the way you describe data (step 3) will differ. For example here you used measures of central tendency (e.g. a median) and measure of dispersion (Q1:Q3) to describe the data. This sort of description is only useful for a continuous variable. Other types of data will need different approaches to describe them (see Case study 2 where categorical data is described with contingency tables). Regardless of the type of data, you should always try to describe the data by plotting it on a graph.

In this case study it was not necessary to conduct the 4th step (hypothesis testing) as we simply wished to explore the time it took to investigate a priority disease report.

This ends case study 1.

Case study 2: Seasonal prevalence of diarrhoea (Mencret) in cattle

Introduction to case study

Diarrhoea is a clinical sign of many serious production diseases, for example internal parasites. Here we will examine the prevalence of diarrhoea in cattle throughout the year to determine if there is a seasonal peak in the amount of diarrhoea. This will allow you to plan for Mencret. For example we can identify if and when an annual peak of cases is observed each year. This may allow you to request anthelmintic supplies before the peak or implement a preventative plan in advance of the seasonal peak. The relevant data is ‘Cattle clinical signs.xls’.

It is important to note that the data used in this training course was downloaded in early 2014. This data was collected when iSIKHNAS had been operating for approximately a year in a limited area of Indonesia. The incomplete nature of our data means that no real conclusions can be made about the results of data-analyses conducted in this course. However, over time more complete data will be available for Indonesians to conduct better analyses.

Skills to be developed during this case study

  • Evaluation of data
  • Creation of new data
  • Describe data using contingency tables and plots
  • Hypothesis testing using odds ratios and confidence intervals

List of files for case study

Data

Cattle clinical signs.xlsx

Videos

Case study 2_exercise 10 video

Case study 2_exercise 11 video

Case study 2_exercise 12 video

Case study 2_exercise 13 video

Case study 2_exercise 14 video

Steps in analysis of Case study 2 (exercises)

Step 1: Objective

The objective in this case study is to determine whether there is a difference in prevalence of cattle with diarrhoea between the dry and wet season. We need this information so that we can potentially improve veterinary management.

For example we may wish to order medicines in time for a seasonal peak in diarrhoea. We may also wish to implement management programs at the correct time of year to reduce diarrhoea. To do this we have to know if there is a seasonal peak and if there is, when this occurs.

There are no exercises on objectives in this case study.

Step 2: Data management

The relevant dataset is Cattle clinical signs.xlsx.

The first step is to preserve your data. Make a copy to work on and archive the original file.

Next we will evaluate the data. We will decide which data is most relevant and we will check for problems in the data (errors and bias). We will then decide whether we need to create additional data to address the objective.

What data is important?

There are many columns of data (variables). We will start by thinking about which data columns are important in the Cattle clinical signs.csv. There are two sorts of data that we need:

  1. Some way of calculating the amount of disease. This implies we need both numerator data (the number of cases of diarrhoea cattle) and denominator data (the number of cattle at risk of disease)
  2. Change of disease over time. We need an idea of when observations of sick cattle occurred.

Exercise 8: Question (Determine which variables are important to address our objectives in the Cattle clinical signs.xlsx).

Hint:

Examine columns that represent the Tanda (signs) being reported and Diagnosa sementara (provisional diagnosis). Which do you think would be most useful to indicate the possible diagnosis of cattle with diarrhoea? Is there data that indicates when infection occurred? What denominator data is present?

Exercise 8: Answer (Determine which variables are important to address our objectives in the Cattle clinical signs.xlsx).

Write notes based on group discussions on which data columns are most important and why?

After you have discussed the issues in your groups, refer to the answers provided to assist your answering of this question. Write the answer here.


Identifying errors and other possible problems in the data

It is important to again determine whether there is any missing data and to think about whether there are any biases (selection or information) or errors in the data. We will do this as exercises as we did in in the first case study.

Exercise 9: Question (examine the data for errors and think about biases that may be present)

Hint:

Use the filter function in Excel to select only the blank cells (missing data), then count them, or use the countif function in Excel. Do this for the two variables of interest.

Whilst using the filter function, examine all the categories in the two columns of interest that were outlined in the exercise above. Are any categories unusual or potentially errors?

What would you suggest we do about the errors (if any were detected)?

Are there any biases likely in the data?

Exercise 9: Answer (examine the data for errors and think about biases that may be present)

Did you detect errors (yes/no)?

Write notes on group discussions of any biases present.



Creating new data to facilitate later analyses

In order to conduct analyses we will have to create some additional data. Do exercise 10 to create this data.

Exercise 10: Question (create new data)

We need to create some new data in order to analyse our data optimally. Think about which data is required.

Hint: We need to know whether the season is the wet or dry season and whether the clinical signs recorded are diarrhoea or other clinical signs. This will require the use of formulas to populate three new columns (variables), month of year, season and diarrhoea/other clinical sign. How would you do this in Excel and where would the information come from in the worksheet?

After you have discussed this, see the answers document for the general approach (and formulas) and watch the video on exercise 10.

Watch the video now.

Exercise 10: Answer (create new data)

Write notes on the video exercise if you would like to.


Step 3: Description of data

This stage is important to further check for errors and to begin to get an understanding of relationships between variables. This understanding will help us to answer our question of interest.

Descriptive of data

We will first examine and summarise each of the columns of interest. To do this will require Pivot tables in Excel. This will be done as an exercise.

Exercise 11: Question (summarise each of the columns of interest (month, season and diarrhoea?))

Establish 3 contingency tables, one for count of clinical signs (diarrhoea or otherwise, one for count of season (wet or dry) and another for count of month of the year.

Hint:

Each of the important data columns contains very simple data, either one of 12 months for the month integer column or wet/dry for Season and Diarrhoea/Other for the Clinical sign column. The previously used data analysis tool which calculated measures of central tendency and dispersion is not really relevant.

In this instance the pivot table option will allow you to summarise each of the categories. This will give you the number of observations in each of the categories (e.g. wet versus dry season).

Watch the video now.

Exercise 11: Answer (Summarise each of the columns of interest (month, season and diarrhoea?))

Insert your pivot table results into the section below.


Relationships between variables

Contingency tables

The next step is to begin to understand the relationships between different variables (e.g. clinical signs by season). This is done prior to formal hypothesis testing. In this instance the data we have is categorical (wet/dry season or month and diarrhoea/other). This suggests that contingency tables will again be very useful. Contingency tables subtotal categories of data and are ideal to calculate measures of association (later).

Exercise 12: Question (contingency tables)

Establish two contingency tables (pivot tables) in Excel and examine the data.

Hint:

The contingency tables should look as follows.

Exercise 12 Diarrhoea by season and month.svg

Hint:

Use the demonstration video to complete the exercise and populate the contingency tables.

Watch the video now.

Does there appear to be a relationship between season or month and diarrhoea?


Exercise 12: Answer (contingency tables)

Paste your two contingency tables below and examine the relationship between time of year and diarrhoea.



Completed contingency tables for discussion.

Below we have completed the contingency tables.

It appears that there may be a slight increase in the proportion of sick cattle that have diarrhoea in the wet season (27 verse 22%). It appears also that the prevalence of sick cattle with diarrhoea may increase during the year, with the middle dry season months having relatively low prevalence of diarrhoea.

However, results are very variable. Later we will use statistical hypothesis testing to allow more rigorous conclusions.

Exercise 12 Diarrhoea by season and month 2.svg

Plotting

It is very important to visualise the data in the form of plots since plots give a lot of information to the viewer.

This is important for the contingency table of month against diarrhoea. We will examine this data using scatterplots and lines of best fit in Excel. A scatterplot is used to plot data that has two variables associated with every observation. Usually, ordinal or continuous variables are required to plot data in a scatter plot.

Here, the prevalence of diarrhoea in all sick cattle presented for a month is one continuous variable (i.e. a percentage that varies from 0-100). We can plot months as an ordinal variable. This means we have an ordinal and continuous variable and we can therefore create a scatterplot with month on one axis (x axis) and prevalence of diarrhoea in sick cattle as another variable (y axis). This will allow us to examine the data and see visually whether the prevalence of diarrhoea cattle changes throughout the year.

Exercise 13: Question (plots of mencret prevalence by month)

Create a scatterplot of month against prevalence of diarrhoea in all sick cattle.

Hint:

Use the contingency table of month against Mencret above to populate the plot. Watch the demonstration video before attempting this.

Watch the video now.


Exercise 13: Answer (plots of mencret prevalence by month)

Include your scatterplot of month against prevalence of diarrhoea in all sick cattle below.




We have included the completed plot below (Figure 6). Some points to make on scatterplot design. The independent variable goes on the horizontal axis and the dependant variable on the vertical axis. The independent variable (month) is assumed to affect the dependant variable (prevalence). We need to include axis titles. We have included a line of best fit which here is a moving average. This plot helps by giving us more information about the relationship between season and prevalence of diarrhoea cattle. The line of best fit goes down in the dry season indicating there may be a relationship between diarrhoea and month of the year. The lowest prevalence appears to be in the dry season around months 4-8 with increasing prevalence in the wet season.


Figure 6

Figure 6.svg

Notes: Key concepts of Measures of Disease

Themes256.png

Step 4: Statistical hypothesis testing

Measure of association and hypothesis testing

There are two common ways of testing hypotheses using standard methods in veterinary epidemiology. This is with measures of association or statistical hypothesis testing. Here we will use a measure of association to answer our question. We will use a more traditional statistical hypothesis testing approach in the next case study (case study 3). In case study 3 we will also provide a comprehensive set of background knowledge notes to introduce the principals of statistical hypothesis testing.

In this case study we will establish a hypothesis, calculate a measure of association (and confidence interval) and interpret the results.

The question of interest here is whether there is a difference in diarrhoea prevalence between the wet or dry season. It is always important to first establish the null hypothesis. The null hypothesis is always the hypothesis of no effect. The null hypothesis is rejected or retained based on the value of the measure of association (and associated confidence interval).

Null hypothesis= Ho = There is no difference between the odds of diarrhoea between the wet and dry season.

The alternative hypothesis is generally opposite the null hypothesis. The alternative hypothesis is tentatively accepted if you have enough statistical evidence to reject the null hypothesis. Here the alternative hypothesis is:

HA = There is a difference between the odds of diarrhoea between the wet and dry season.

Exercise 14: Measure of association between Mencret and season.

Calculate a measure of association and confidence interval between diarrhoea and season.

Hint:

Calculate an odds ratio using the data from the two way contingency table of season and diarrhoea from Exercise 12.

We will be using Epitools for this exercise (http://epitools.ausvet.com.au/content.php?page=home). Please browse the Epitools website. Also, watch the video which details Epitools and how to use Epitools to calculate measures of association.

Watch the video now


Exercise 14: Measure of association between Mencret and season.

Place your epitools results here.



Here we will calculate the odds ratio by hand to assist you in interpreting the exercise. Here is the contingency table.


Dry Season
Wet Season
Diarrhoea
246
544
Other
875
1437
Total cattle
1121
1981
Prevalence of Mencret in sick cattle (%)
22
27

Odds of diarrhoea in the wet season:

544/1437 = 0.3786

Odds of diarrhoea in dry season:

246/875 =0.2811

Odds ratio:

0.3786/0.2811 = 1.35

We also know the 95% confidence interval is the interval 1.13 - 1.60. This means there is an association between season and diarrhoea. We thus reject the null hypothesis and infer that the alternative hypothesis is supported by the data: namely that season does have a role in the proportion of diarrhoea in passively surveyed cattle. However, it is important to note that this is simply a statistical association between the wet season and diarrhoea. It does not prove that the wet season causes diarrhoea. Instead it indicates that this may be possible. Further investigation and study would be required to determine that any factor actually causes disease.

Summary of case study 2

In this case study you again conducted several important steps in data analysis:

  1. An objective was presented to you
  2. You managed data (by backing up your original data, examining the data for errors and creating new data)
  3. You described data
  4. You conducted hypothesis testing using odds ratios and confidence intervals.

This time you described data by establishing contingency tables (called pivot tables in Excel) and by creating scatterplots and lines of best fit. These steps were almost enough to answer your research question of whether there was a change in diarrhoea between the wet and dry season. It appeared that there was a drop in diarrhoea in the dry season on the plot. We conducted some hypothesis testing using a measure of association and confidence interval to account for the effect of sampling variability so we could be sure what we saw in the tables and plots was most likely a real effect.


This is the end of case study 2.

Case study 3: Beef self-sufficiency (based on slaughter statistics)

Introduction to case study

The number, type and date of livestock slaughtered at abattoirs in Indonesia are recorded in iSIKHNAS. We will examine the cattle records.

We will determine whether there are changes in the number of different classes (e.g. female productive versus Australian imported cattle) of cattle slaughtered over the collection period of iSIKHNAS. We can make tentative conclusions about Indonesia’s quest for beef self-sufficiency based on changes in the slaughter of different categories of cattle.

It is important to note that the data used in this training course was downloaded in early 2014. This data was collected when iSIKHNAS had been operating for approximately a year in a limited area of Indonesia. The incomplete nature of our data means that no real conclusions can be made about the results of data-analyses conducted in this course. However, over time more complete data will be available for Indonesians to conduct better analyses.

Skills to be developed during this case study

  • Data management
  • Summarise data with plots
  • Understanding of null and alternative hypotheses
  • Hypothesis testing using a statistical test

List of files for case study

Data

Abattoir_April_2014.xlsx

Videos

Case 3 Exercise 16 video

Case 3 Exercise 17 video

Case 3 Exercise 18 video


Themes256.png

Notes: Key concepts for hypothesis testing and Case Study 3

Steps in analysis of Case study 3 (exercises)

Step 1: Objective

Exercise 15:

Break into small groups and discuss the objective of the case study. Do the hypotheses support the objective of the analyses?

Report back to the group with your conclusions.

Hint:

The broader focus of the analysis is to address an issue of concern for Indonesia, namely achievement of self-sufficiency in beef production. One way to understand whether self-sufficiency is increasing is to examine slaughter statistics.

Hypotheses are:

Null hypothesis: Month and slaughter categories are independent. That is, there is no effect of month on total slaughter numbers in different categories of slaughtered cattle.

Alternative hypothesis = Month and slaughter categories are dependant. That is, there is an effect of month on total slaughter numbers in different categories of slaughtered cattle.

Are these hypotheses adequate?

Exercise 15: Do the hypotheses support the objective of the analyses?

Write brief notes on group discussions of this topic.

Step 2: Data management

Back up your data before you alter it!

Exercise 16: Question (create a pivot table of cattle slaughter statistics by month).

Hint:

Select all the cattle data from the Excel worksheet titled Abattoir 2014.xls and save in a fresh work sheet. Open cattle slaughter data, error check and create a contingency table (pivot table in Excel).

Watch video now. Note you have already done some of this in previous exercises.

Exercise 16: Answer (create a pivot table of cattle slaughter statistics by month).

Include the pivot table you created below.


Step 3: Description of data

Exercise 17: Question (summary plots)

Do some summary plots for the contingency table. The aim is to understand the data before hypothesis testing.

Hint: Watch the video for assistance.


Exercise 17: Answer (summary plots)

Include the summary plots you created below.

Step 4: Statistical hypothesis testing

Exercise 18: Question (chi-squared analysis of slaughter data)

Conduct a chi-squared statistical test on the pivot table.

Hint: Watch this video now.


Exercise 18: Answer (chi-squared analysis of slaughter data)

Include your chi-squared test results below.


Exercise 19: Question (inference)

Break into your groups and discuss the findings of the significant chi-squared test. What does it mean? That is, make some inferences.

Hint:

Look at the result of the statistical test. Look at the summary plots from exercise 17 and look at the expected verse observed values from exercise 18. Where are the major differences between expected and observed values that are most contributing to the size of the chi-squared statistic? These are the areas where the effect is largest.


Exercise 19: Answer (inference)

Write brief notes on group discussions on inference about the significant chi-squared test result for the slaughter data.


Summary of case study 3

In this case study you were first presented with information on the basic statistical hypothesis testing approach. This can be complex, but is a consistent approach used in many studies. Then you again conducted several important steps in data analysis:

  1. The hypotheses were checked to see that they supported the objective
  2. Managed data (by backing up your original data, examining the data for errors and creating new data)
  3. Described data with contingency tables and plots
  4. Conducted hypothesis testing

This time you conducted hypothesis testing with a formal statistical approach using a chi-squared test. The major steps in the hypothesis testing approach can be used for most datasets. You now have the general skills to do this. With time and experience you will improve your knowledge of which statistical test to use and how to apply them.

This ends case study 3.

Concluding remarks

Data analysis is critical to good animal health management and policy formation. Indonesia is fortunate to have an excellent and newly functional animal health information system, iSIKHNAS. However, it is no use having one of the world's best information systems if no-one uses the data. Hence, this course has focused on helping you to begin to use iSIKHNAs data. These approaches may help you to improve animal health decision making.

There are several standard steps to data analysis, including developing a research question of interest (objective), data management, description and hypothesis testing. All of these steps were used in this course. In order to consolidate your new skills it is recommended that you apply these steps to questions of interest to you immediately upon your return to your workplace. For example if you spend a day or two per month for the next several months you will consolidate your new skills.

We have only had time to use a fraction of the available statistical approaches (e.g. measures of association and a chi-squared test). You may choose to expand your knowledge beyond what we have learnt in this course. There are a number of useful text books that can assist you in developing your statistical skills. Veterinary Epidemiologic Research (Dohoo et al., 2009) is a useful text book that covers veterinary epidemiology and presents some statistics in this field. Statistics for Veterinary and Animal Science (Petrie and Watson, 2006) is a useful text book that applies statistics more generally to veterinary science. If you decide that you really want to learn both R and expand your statistical knowledge you should complete the appendix on R. Introductory Statistics with R (Dalgaard, 2008) is a very useful text book that introduces both statistics and R. If you start using R, then the statistical world will be at your figure tips!

Appendix 1: Extension work using R.

Introduction

This appendix introduces the reader to R. It is for those course participants who do a substantial amount of statistical analyses (or aim to) and wish to improve their capability beyond what Excel allows. There may not be time to complete this appendix during the course, but it can be completed later.

R is a statistical and graphics software environment. It is widely recognised and used throughout the world. R provides a very powerful and flexible environment in which to conduct simple or complex statistics or to produce publication ready graphics. It is also free, and constantly updated. It is widely supported by a range of scientists who are constantly contributing new packages that expand the capability of R. R will be around for the long term, so any effort you invest in learning the software will be useful throughout your career. As there are so many avid users around the world there are many online help sites and discussion boards with free code and advice.

See http://www.r-project.org/ for further details.

Read the first chapter of the R-intro.pdf document to further expand your knowledge (6 pages).

In summary, if you want to do serious statistics, then you need a professional statistics package, not Excel. There are a number of software packages available (e.g. SAS, Stata and R), but R is one of the best and it is free. It is a little harder to use than Excel because you have to learn some scripting commands, but it is well worth the effort. We will start by telling you what software you require. We will then lead you through Case study 1 again. You can begin to use R.

Software to download

There are only two pieces of free software to download. These will give you the complete tools of a statistician! These are R and RStudio. Some of you may need to get your IT people to assist you to do this as you may not have the necessary permissions on your computers.

R

Please download R from a CRAN mirror, preferably either:


http://cran.repo.bppt.go.id/ Agency for The Application and Assessment of Technology
http://cran.unej.ac.id The University of Jember

Most of you will download the R for windows version, and this would be preferred unless you have a MAC. Once downloaded, run the file and install the program.

RStudio

RStudio is a development environment for R. One of the most useful features of RStudio is that you can create files of R code that you can store, archive and send easily to R for implementation. It makes writing and running R code easier. Essentially RStudio allows you to create an R code file, work on it with text editing capability and send it to R for running.

See http://www.rstudio.com/ide/ for further details.

Please download RStudio from http://www.rstudio.com/ide/download/.

Exercise

  1. Open R studio
  2. Watch the video - Introduction to R
  3. Repeat Case study 1 from this Data Analysis course, but in R (using R studio) instead of Excel.

Further development of R and statistical skills

Purchase an R introductory statistics book, read it and do the exercises. Introductory statistics with R by Peter Dalgaard (Dalgaard, 2008) is excellent (http://www.springer.com/statistics/computational+statistics/book/978-0-387-79053-4).

Complete reading "An introduction to R"

Browse the website Quick-R (http://www.statmethods.net/)

Practice and don't give up! You will improve and it will be worth the effort.

Appendix 2: Answers to exercises in data training manual

Case Study 1: Veterinary Performance indicators

Exercise 1: Examine the data

Break into groups of 2-4 people and conduct the following exercises. Place your answers in your workbooks.

Carefully examine the worksheet columns and the data contained within each of the columns.

What do each of the columns mean?

ID = A unique ID number.

Tanggal laporan = date of report = The date the report was received at the DINAS.

Jumlah kasus = number of cases= The number of cases of the syndrome.

Syndroma = One of several syndromes: Abortion and swollen joints, Biting and behavioural change, Fever in pigs, Limping, drooling and vesicles, Poultry mortality and Sudden death.

These are designed to detect several diseases for example HPAI, rabies, FMD, Brucellosis, Anthrax

Nama pengirim = senders name = The veterinarian who submitted the report.

Provinsi = the province in which the syndrome occurred.

Kabupaten= regency in which the syndrome occurred.

Kecematan= district?

Desa = village in which the syndrome occurred

Tanggal diinvestigasi = date investigated = the date when the syndrome was investigated

Staf teknis dinas = technical service staff that investigated the syndrome

Tipe investigasi = type of investigation.

Diagnosa sementara = provisional diagnosis = temporary diagnosis until further diagnostic work up.

How many priority syndrome reports are there in the worksheet?

There are 108 rows and hence 108 reports. The first row is a header row.

Think about the objective of the analyses we wish to conduct.

Which data columns do you think are particularly useful to address our objective? As a hint, two important columns are highlighted in Figure 3.

The two important columns are Tanggal laporan and Tanggal diinvestigasi. Together these columns can be used to determine the amount of time that elapsed between a report of a syndrome and an investigation.

Report your findings back to the larger group.

Exercise 2: Outline a detailed objective

Determine a more detailed analysis objective. Hint, what would the difference between the two highlighted date columns show? Use this to develop a specific objective for the analysis.

We need to analyse the difference between date of report and date of investigation of the priority syndrome. We could record this as the approximate number of hours between report and investigation.

Report back to the group with your refined objective.

The facilitator will eventually need to guide all participants to the same final objective. This is that we will analyse the difference between date of report and date of investigation of the priority syndrome recorded in hours. We will then describe this difference and determine whether there are differences between the time to investigate for rabies and abortion.

We will then infer whether the vet services are doing rapid enough investigations overall and whether there is a different level of urgency for rabies and abortions.

Exercise 3: Count missing data entries using Excel

Count the number of missing data points in the Tanggal diinvestigasi column of the isikhnas_priority disease syndromes_March_2013.csv. Hint: use the filter function in Excel to select only the blank cells (missing data), then count them, or use the countif function in Excel.

Answer: Two methods could be useful. A filter or a countif (function) method.

Filter method

Go to data, click the filter tool and then go to the Tanggal diinvestigasi column and click the filter arrow, then deselect all, scroll down and select blanks. This will leave only those rows that have no data in the column. These can then be counted. There are approximately 40.

Countif method.

Highlight a blank cell. Enter the following formula:

=COUNTIF(J2:J109,"")

Press enter and the number of blank cells will be counted.

The J2:J109 means examine all the cells in the Tanggal diinvestigasi column. This number will need to be changed if the data is changed from isikhnas_priority disease syndromes_March_2013.csv where there are only 108 entries/rows/disease investigations.

The " " refers to the fact that we are looking for a blank/missing data. That is, the quotation marks contain nothing.

See the associated training video for Case study 1, exercise 3.


Exercise 4: selection and information bias.

In your groups, discuss what sort of selection and information bias may be present in the iSIKHNAS data.

Answer:

Can you think of any syndromic investigations that may be routinely missing from the dataset? What effect would this have on conclusions based on the data that you do have? This is selection bias.

Example: You are from a province that has a very large commercial dairy industry. These producers use their own company veterinarians and do not call the state vets when they have an abortion problem. Hence you have selection bias where abortion syndromes are only recorded if they occur in small producers cattle. Your estimate of how much abortion is present is biased because you have no information on the amount of abortion in dairy cattle (and this prevalence is different to abortions in cattle of small producers). There may be more or less abortions present in the provinces cattle depending on whether the dairy industry has more or less abortions than the small producers.

Can you think of any frequent errors that may occur when you are investigating a syndromic event and making a provisional diagnosis? This is information bias.

Example: Assume rabies has never been diagnosed in your province before. When some Gila galak dogs are seen it is assumed that they have eaten a toxin and are intoxicated. This is information bias because the dogs were really infected by Rabies.

Exercise 5: Creating new data (time to investigation)

Hint: See the video for assistance.

Exercise 6: Calculating measures of central tendency and dispersion of jam untuk menyiasat

For the variable jam untuk menyiasat in the isikhnas_priority disease syndromes_March_2013.csv, calculate the mean, median, standard deviation, confidence interval, range and interquartile range.

Hint: Use the descriptive analyses tool in Excel.

Answer

To do these sorts of analyses easily in Excel, the data analysis option from the Analysis ToolPak is useful. First, you have to make sure this is loaded in Excel.

To do this follow the following path:

File>options>add ins>Analysis ToolPak>Go>Analysis ToolPak- check>OK

You will then see a tab on the data ribbon in Excel.

See the video for instructions.

You can then interpret these descriptive statistics as follows:

The mean is 750 hours and the median is 0 hours. These answers are very different and tell us that the data is skewed (or is not normally distributed). When the data is not normally distributed, the median is a better measure of central tendency. Hence it appears that most common time to investigation are 0 hours (this is very good and indicates a good veterinary service!).

Whilst a standard deviation (1234 hours) and confidence interval for the mean (±294 hours) are given these are not very useful because the data is skewed and they will not be very accurate. A better measure is the interquartile range. This is not provided by the data analysis tool and a separate method is required to calculate this.

It can be calculated with the quartile function in Excel.

Enter the following functions in an empty cell in the Syndromas prioritas worksheet. Do two calculations, one to find the 1st quartile, a second to find the 3rd quartile. Then you can subtract the 1st quartile from the third quartile to get the interquartile range.

The equation for the first quartile is:

=QUARTILE(N2:N109,1)

This means we want to calculate the 1st quartile from the data in the cells N2 to N109.

The equation for the third quartile is:

=QUARTILE(N2:N109,3)

This means we want to calculate the 3rd quartile from the data in the cells N2 to N109.

The answers are 0 and 1218 and so the IQR is 1218-0 = 1218 hours.

Thus 50% of the times to investigate are between 0 and 1218 hours around the median of 0 hours. The median time to investigate of 0 hours and a IQR of 1218 hours indicates a rapid time to investigate reports of priority syndromes, but that there is considerable variability.

This variability is worrying as if it was an important syndrome, then an important disease could have been spreading for a long time before it is investigated. This indicates there are a small proportion of veterinary investigations of priority diseases that take a very long time. However, these long periods of time between reporting and investigation have generally occurred in the earlier entries in the dataset and perhaps were just a temporary problem to do with data recording. In the more recent observations these long periods of time do not appear to be occurring. Further checking of the accuracy of the data would be required to determine the issue would be required (for example by speaking to the field investigators).

If you don't have the analysis ToolPak:

Enter these formulas into Excel individually to calculate each parameter separately.

Mean: =AVERAGE(N2:N109)

Standard deviation: =STDEV.S(N2:N109)

Confidence Interval: =CONFIDENCE(0.05,1233.55307,70)

Median: =MEDIAN(N2:N109)

Q1: =QUARTILE(N2:N109, 1)

Q3: =QUARTILE(N2:N109, 3)


Exercise 7: Histogram of jam untuk menyiasat in Excel

Create a histogram of jam untuk menyiasat.

See video for assistance.


Case study 2: Mencret cattle

Exercise 8: Determine which data are important to address our objectives in the Cattle clinical signs.csv.

Answer

Broadly we need to be able to determine the number of cases and when these occurred. Additionally in order to calculate a measure of disease frequency we also need to determine the population at risk. We also need to consider what sort of biases may be present in the data.

Number of cases:

There is one relevant column that could be considered to determine cases of diarrhoea in the population. This is the Tanda (signs) column. The diagnose sementara (provisional diagnosis) column is too detailed and does not allow all the diarrhoea cases to be identified easily.

Population at risk:

This data is gathered from farmer reports of disease and is thus passive surveillance data. Unfortunately passive surveillance data is always biased, and in this case it is very likely that any data we gather from iSIKHNAS will under-estimate the number of animals at risk and afflicted in the population. This is because some farmers will never report disease so we will miss many cases of disease and in addition farmers won't report healthy cattle! However, if we pretend this is all the data we have to estimate the amount of diarrhoea cattle in the population we are interested in, this is still useful data. It is useful, because although the data is biased and will under-estimate the number of cattle at risk, we can assume that the under-estimation is constant and does not change over time. If this is true we can still look at the change in the amount of diarrhoea in cattle over time accurately. The exact proportion of diarrhoea cattle may not be accurate but the relative change in the proportion of diarrhoea cattle will be accurate.

Thus the population at risk could be assumed to be all the reports of diseased cattle in the data set. However we know this population estimate it is biased and will underestimate the number of cattle in the population but it can allow us to determine how the proportion of diarrhoea cattle change in the reported cattle population over time.

Date of occurrence:

There are two date columns, the Tanggal laporan and the Tanggal diinvestigasi columns. Of these, the Tanggal laporan (date of report) column most accurately reflects the time of occurrence of the Mencret. (We know from case study 1 that there is sometimes a delay in investigating disease reports, even when they are priority syndromes.)

Exercise 9: Examine the data for errors and think about biases that may be present

Answer

Use the filter function and look for blanks in each of our two data columns (Tanda and Tanggal laporan). There appear to be no blanks.

Next look at the range of dates in the Tanggal laporan column. Examine the first and last dates in the data set. Are the dates reasonable? The range is from 1/5/2013 to the 16/04/2014 which does indeed appear reasonable.

Examine the categories of common signs reported in the Tanda column. Are these reasonable? There appear to be a few complexities. For example there are many diarrhoea entries, for example diarrhoea is listed as is diarrhoea w/blood and many other categories of diarrhoea.

However, generally the data appear good and there is no obvious reason why data analysis should not continue.

The main bias possible is selection bias. That is, farmers with healthy cattle will not report them to the authorities and hence the denominator will be under-estimated if examining prevalence of diarrhoea.

Exercise 10: New data

Answer

New month integer column

In order to analyse the data to see whether wet or dry season make a difference to the occurrence of diarrhoea we need to be able to categorise each report of a clinical sign into the wet or dry season. To do this will take several steps.

The date data is hard to use in Excel. We will therefore create a new column with an integer (whole number) corresponding to the number of the month in the year (i.e. months 1-12 with month 1 being January, 2 being February and Month 12 being December etc.).

Step 1: Date format

First check that Excel recognises the Tanggal laporan column as a column with date formatting.

Step 2: Date integer

To create the month integer column will require the Month formula (=Month(serial_number)). This formula returns an integer from 1-12 depending on the date in the referent cell (serial number). So name a new column "Month integer", fill in the month formula and drag this formula down the new column for the length of the entire dataset. For example, if the date data is in column B and you are in the first row below the heading the formula will read: =Month(B2). This will return 1 if the date in cell B2 is in January, 2 if the date in cell B2 is February etc.

Step 3: Wet or dry season categorisation

The next step is to create a new column "season" and populate this with a category of either "wet" or "dry" depending on whether the month of the date of report falls in the wet or dry season. To do this will require an ‘If’ formula. You may have covered these sorts of formulas in your previous Excel training. However, we have established a formula for you below. This is a nested formula which allows all the possible month integers that could be the wet season to be examined and Wet returned. So in this case if the month of the year is 1,2,3,4, 11 or 12 (January, February, March, April, November or December) then wet season is returned. Otherwise if none of the integers are in the wet season (May, June, July, August, September, October) then dry is returned. This categorisation will later allow us to compare the diarrhoea cases that occur in the wet and dry season.

=IF(C2=1,"Wet",IF(C2=2,"Wet", IF(C2=3,"Wet",IF(C2=4,"Wet",IF(C2=11,"Wet",IF(C2=12,"Wet","Dry"))))))

Drag this down the column to cover the entire data set so that every observation is given a wet or dry season classification.

Diarrhoea or other clinical sign:

We will also create a new column called ‘diarrhoea?’. This column will record whether the Tanda column has a diarrhoea sign or another type of sign. This will be useful for calculating prevalence later. Again we will use an If formula.

=IF(ISNUMBER(SEARCH("diarrhoea",G2)),"Diarrhoea","Other")

This formula actually says, if the cell on the row that records the Tanda (G2) contains the word "diarrhoea" anywhere in the cell, then record "Diarrhoea" in the new cell, otherwise, if G2 does not contain the word diarrhoea then record "Other" in the new cell.

Please see the video Case study 2_Exercise 10 for assistance in completing this exercise. You should achieve the following screenshot:

Answer Case Study 2 Ex 10.png

Exercise 11: Describing the new data columns with pivot tables.

Answer

As the data is categorical data (e.g. season, or month) it is not possible to use the data analysis tool we used in case study 1. Instead we need to use pivot tables to describe our data.

The video titled case study 2_exercise 11 describes the approach. However, the following tables should be produced in Excel using the data provided and the pivot table approach.

Answer Case Study 2 Ex 11.svg


Exercises 12-14:

Answers to exercise 12, 13 and 14 are provided in the video and in the manual.

Case study 3: Slaughter statistics and beef production self-sufficiency.

Exercise 15: Do the hypotheses support the objective of the analyses?

Answer to exercise 15 is subjective with no specific answer. The purpose of the exercise is to encourage the participants to think about the hypotheses and overall objective.

Exercise 16: Create a pivot table of cattle slaughter statistics by month.

Answer:

Provided as a video. But is also included here.


Answer Case Study 3 Ex 16.png

Exercise 17: Do some summary plots for the contingency table.

Answer to exercise 17 is provided as a video

However, the following charts/plots should be produced in the exercise:

Exercise 18: Conduct a chi-squared statistical test on the pivot table.

Answer:

Chi-squared statistic = 3599.6, degrees of freedom = 15, p<0.000000001.

The critical value for rejecting the null hypothesis is 25, so 3599.6 definitely exceeds this value!

Reject the null hypothesis and accept the alternative hypothesis (there is dependence between the numbers of cattle slaughtered by category of cattle and month slaughtered).

Exercise 19: Inferences

Answer:

The most important result from the chi-squared statistical test is that the chi-squared value is very large. The cut point value for a chi-squared distribution with 15 degrees of freedom is 25, and our value of 3600 definitely exceeds this! We are therefore very confident that our data is not chi-squared distributed, and is dependant.

We reject the null hypothesis and accept the alternative hypothesis:

HA = There is an effect of month on total slaughter numbers in different categories of slaughtered cattle.

We now examine the chi-squared values for each cell and the observed and expected values, as well as the plots to determine where the dependence is most obviously occurring in the contingency table and therefore what effect is occurring.

It is obvious that the largest chi-squared values are found in later slaughter months for both Australian cattle and Indonesian male cattle. This is very useful to know as it indicates perhaps that reliance on imported cattle is declining to some extent whilst reliance on Indonesian slaughter cattle is increasing. This effect tends to support the notion that Indonesia is increasing its self-sufficiency in beef production.

It is worth making a cautionary note about this effect. This data is on a short time frame (only 6 months). To be sure of any trends a much longer time period of observation would be required and a much more rigorous statistical examination would be required. Additionally, some of the effect observed is certainly due to increasing numbers of slaughtered cattle being reported as iSIKHNAS reporting increases over time.

References

Dalgaard, P., 2008. Introductory Statistics with R. Springer.

Dohoo, I., Martin, W., Stryhn, H., 2009. Veterinary Epidemiologic Research. VER Charlottetown.

Petrie, A., Watson, P., 2006. Statistics for Veterinary and Animal Science. Wiley.




  1. The Oxford dictionary definition of data is: ‘Facts and statistics collected together for reference or analysis’