Daftar isi
Extension: Using R, statistical and graphics software.
Introduction
This appendix introduces the reader to R. It is for those course participants who do a substantial amount of statistical analyses (or aim to) and wish to improve their capability beyond what Excel allows. There may not be time to complete this appendix during the course, but it can be completed later.
R is a statistical and graphics software environment. It is widely recognised and used throughout the world. R provides a very powerful and flexible environment in which to conduct simple or complex statistics or to produce publication ready graphics. It is also free, and constantly updated. It is widely supported by a range of scientists who are constantly contributing new packages that expand the capability of R. R will be around for the long term, so any effort you invest in learning the software will be useful throughout your career. As there are so many avid users around the world there are many online help sites and discussion boards with free code and advice.
See http://www.r-project.org/ for further details.
Read the first chapter of the R-intro.pdf document to further expand your knowledge (6 pages).
In summary, if you want to do serious statistics, then you need a professional statistics package, not Excel. There are a number of software packages available (e.g. SAS, Stata and R), but R is one of the best and it is free. It is a little harder to use than Excel because you have to learn some scripting commands, but it is well worth the effort. We will start by telling you what software you require. We will then lead you through Case study 1 again. You can begin to use R.
Software to download
There are only two pieces of free software to download. These will give you the complete tools of a statistician! These are R and RStudio. Some of you may need to get your IT people to assist you to do this as you may not have the necessary permissions on your computers.
R
Please download R from a CRAN mirror, preferably either:
http://cran.repo.bppt.go.id/ | Agency for The Application and Assessment of Technology |
http://cran.unej.ac.id | The University of Jember |
Most of you will download the R for windows version, and this would be preferred unless you have a MAC. Once downloaded, run the file and install the program.
RStudio
RStudio is a development environment for R. One of the most useful features of RStudio is that you can create files of R code that you can store, archive and send easily to R for implementation. It makes writing and running R code easier. Essentially RStudio allows you to create an R code file, work on it with text editing capability and send it to R for running.
See http://www.rstudio.com/ide/ for further details.
Please download RStudio from http://www.rstudio.com/ide/download/.
Exercise
- Open R studio
- Watch the video - Introduction to R
- Repeat Case study 1 from this Data Analysis course, but in R (using R studio) instead of Excel.
Further development of R and statistical skills
Purchase an R introductory statistics book, read it and do the exercises. Introductory statistics with R by Peter Dalgaard (Dalgaard, 2008) is excellent (http://www.springer.com/statistics/computational+statistics/book/978-0-387-79053-4).
Complete reading "An introduction to R"
Browse the website Quick-R (http://www.statmethods.net/)
Practice and don't give up! You will improve and it will be worth the effort.
Appendix 2: Answers to exercises in data training manual
Case Study 1: Veterinary Performance indicators
Exercise 1: Examine the data
Break into groups of 2-4 people and conduct the following exercises. Place your answers in your workbooks.
Carefully examine the worksheet columns and the data contained within each of the columns.
What do each of the columns mean?
ID = A unique ID number.
Tanggal laporan = date of report = The date the report was received at the DINAS.
Jumlah kasus = number of cases= The number of cases of the syndrome.
Syndroma = One of several syndromes: Abortion and swollen joints, Biting and behavioural change, Fever in pigs, Limping, drooling and vesicles, Poultry mortality and Sudden death.
These are designed to detect several diseases for example HPAI, rabies, FMD, Brucellosis, Anthrax
Nama pengirim = senders name = The veterinarian who submitted the report.
Provinsi = the province in which the syndrome occurred.
Kabupaten= regency in which the syndrome occurred.
Kecematan= district?
Desa = village in which the syndrome occurred
Tanggal diinvestigasi = date investigated = the date when the syndrome was investigated
Staf teknis dinas = technical service staff that investigated the syndrome
Tipe investigasi = type of investigation.
Diagnosa sementara = provisional diagnosis = temporary diagnosis until further diagnostic work up.
How many priority syndrome reports are there in the worksheet?
There are 108 rows and hence 108 reports. The first row is a header row.
Think about the objective of the analyses we wish to conduct.
Which data columns do you think are particularly useful to address our objective? As a hint, two important columns are highlighted in Figure 3.
The two important columns are Tanggal laporan and Tanggal diinvestigasi. Together these columns can be used to determine the amount of time that elapsed between a report of a syndrome and an investigation.
Report your findings back to the larger group.
Exercise 2: Outline a detailed objective
Determine a more detailed analysis objective. Hint, what would the difference between the two highlighted date columns show? Use this to develop a specific objective for the analysis.
We need to analyse the difference between date of report and date of investigation of the priority syndrome. We could record this as the approximate number of hours between report and investigation.
Report back to the group with your refined objective.
The facilitator will eventually need to guide all participants to the same final objective. This is that we will analyse the difference between date of report and date of investigation of the priority syndrome recorded in hours. We will then describe this difference and determine whether there are differences between the time to investigate for rabies and abortion.
We will then infer whether the vet services are doing rapid enough investigations overall and whether there is a different level of urgency for rabies and abortions.
Exercise 3: Count missing data entries using Excel
Count the number of missing data points in the Tanggal diinvestigasi column of the isikhnas_priority disease syndromes_March_2013.csv. Hint: use the filter function in Excel to select only the blank cells (missing data), then count them, or use the countif function in Excel.
Answer: Two methods could be useful. A filter or a countif (function) method.
Filter method
Go to data, click the filter tool and then go to the Tanggal diinvestigasi column and click the filter arrow, then deselect all, scroll down and select blanks. This will leave only those rows that have no data in the column. These can then be counted. There are approximately 40.
Countif method.
Highlight a blank cell. Enter the following formula:
=COUNTIF(J2:J109,"")
Press enter and the number of blank cells will be counted.
The J2:J109 means examine all the cells in the Tanggal diinvestigasi column. This number will need to be changed if the data is changed from isikhnas_priority disease syndromes_March_2013.csv where there are only 108 entries/rows/disease investigations.
The " " refers to the fact that we are looking for a blank/missing data. That is, the quotation marks contain nothing.
See the associated training video for Case study 1, exercise 3.
Exercise 4: selection and information bias.
In your groups, discuss what sort of selection and information bias may be present in the iSIKHNAS data.
Answer:
Can you think of any syndromic investigations that may be routinely missing from the dataset? What effect would this have on conclusions based on the data that you do have? This is selection bias.
Example: You are from a province that has a very large commercial dairy industry. These producers use their own company veterinarians and do not call the state vets when they have an abortion problem. Hence you have selection bias where abortion syndromes are only recorded if they occur in small producers cattle. Your estimate of how much abortion is present is biased because you have no information on the amount of abortion in dairy cattle (and this prevalence is different to abortions in cattle of small producers). There may be more or less abortions present in the provinces cattle depending on whether the dairy industry has more or less abortions than the small producers.
Can you think of any frequent errors that may occur when you are investigating a syndromic event and making a provisional diagnosis? This is information bias.
Example: Assume rabies has never been diagnosed in your province before. When some Gila galak dogs are seen it is assumed that they have eaten a toxin and are intoxicated. This is information bias because the dogs were really infected by Rabies.
Exercise 5: Creating new data (time to investigation)
Hint: See the video for assistance.
Exercise 6: Calculating measures of central tendency and dispersion of jam untuk menyiasat
For the variable jam untuk menyiasat in the isikhnas_priority disease syndromes_March_2013.csv, calculate the mean, median, standard deviation, confidence interval, range and interquartile range.
Hint: Use the descriptive analyses tool in Excel.
Answer
To do these sorts of analyses easily in Excel, the data analysis option from the Analysis ToolPak is useful. First, you have to make sure this is loaded in Excel.
To do this follow the following path:
File>options>add ins>Analysis ToolPak>Go>Analysis ToolPak- check>OK
You will then see a tab on the data ribbon in Excel.
See the video for instructions.
You can then interpret these descriptive statistics as follows:
The mean is 750 hours and the median is 0 hours. These answers are very different and tell us that the data is skewed (or is not normally distributed). When the data is not normally distributed, the median is a better measure of central tendency. Hence it appears that most common time to investigation are 0 hours (this is very good and indicates a good veterinary service!).
Whilst a standard deviation (1234 hours) and confidence interval for the mean (±294 hours) are given these are not very useful because the data is skewed and they will not be very accurate. A better measure is the interquartile range. This is not provided by the data analysis tool and a separate method is required to calculate this.
It can be calculated with the quartile function in Excel.
Enter the following functions in an empty cell in the Syndromas prioritas worksheet. Do two calculations, one to find the 1st quartile, a second to find the 3rd quartile. Then you can subtract the 1st quartile from the third quartile to get the interquartile range.
The equation for the first quartile is:
=QUARTILE(N2:N109,1)
This means we want to calculate the 1st quartile from the data in the cells N2 to N109.
The equation for the third quartile is:
=QUARTILE(N2:N109,3)
This means we want to calculate the 3rd quartile from the data in the cells N2 to N109.
The answers are 0 and 1218 and so the IQR is 1218-0 = 1218 hours.
Thus 50% of the times to investigate are between 0 and 1218 hours around the median of 0 hours. The median time to investigate of 0 hours and a IQR of 1218 hours indicates a rapid time to investigate reports of priority syndromes, but that there is considerable variability.
This variability is worrying as if it was an important syndrome, then an important disease could have been spreading for a long time before it is investigated. This indicates there are a small proportion of veterinary investigations of priority diseases that take a very long time. However, these long periods of time between reporting and investigation have generally occurred in the earlier entries in the dataset and perhaps were just a temporary problem to do with data recording. In the more recent observations these long periods of time do not appear to be occurring. Further checking of the accuracy of the data would be required to determine the issue would be required (for example by speaking to the field investigators).
If you don't have the analysis ToolPak:
Enter these formulas into Excel individually to calculate each parameter separately.
Mean: =AVERAGE(N2:N109)
Standard deviation: =STDEV.S(N2:N109)
Confidence Interval: =CONFIDENCE(0.05,1233.55307,70)
Median: =MEDIAN(N2:N109)
Q1: =QUARTILE(N2:N109, 1)
Q3: =QUARTILE(N2:N109, 3)
Exercise 7: Histogram of jam untuk menyiasat in Excel
Create a histogram of jam untuk menyiasat.
See video for assistance.
Case study 2: Mencret cattle
Exercise 8: Determine which data are important to address our objectives in the Cattle clinical signs.csv.
Answer
Broadly we need to be able to determine the number of cases and when these occurred. Additionally in order to calculate a measure of disease frequency we also need to determine the population at risk. We also need to consider what sort of biases may be present in the data.
Number of cases:
There is one relevant column that could be considered to determine cases of diarrhoea in the population. This is the Tanda (signs) column. The diagnose sementara (provisional diagnosis) column is too detailed and does not allow all the diarrhoea cases to be identified easily.
Population at risk:
This data is gathered from farmer reports of disease and is thus passive surveillance data. Unfortunately passive surveillance data is always biased, and in this case it is very likely that any data we gather from iSIKHNAS will under-estimate the number of animals at risk and afflicted in the population. This is because some farmers will never report disease so we will miss many cases of disease and in addition farmers won't report healthy cattle! However, if we pretend this is all the data we have to estimate the amount of diarrhoea cattle in the population we are interested in, this is still useful data. It is useful, because although the data is biased and will under-estimate the number of cattle at risk, we can assume that the under-estimation is constant and does not change over time. If this is true we can still look at the change in the amount of diarrhoea in cattle over time accurately. The exact proportion of diarrhoea cattle may not be accurate but the relative change in the proportion of diarrhoea cattle will be accurate.
Thus the population at risk could be assumed to be all the reports of diseased cattle in the data set. However we know this population estimate it is biased and will underestimate the number of cattle in the population but it can allow us to determine how the proportion of diarrhoea cattle change in the reported cattle population over time.
Date of occurrence:
There are two date columns, the Tanggal laporan and the Tanggal diinvestigasi columns. Of these, the Tanggal laporan (date of report) column most accurately reflects the time of occurrence of the Mencret. (We know from case study 1 that there is sometimes a delay in investigating disease reports, even when they are priority syndromes.)
Exercise 9: Examine the data for errors and think about biases that may be present
Answer
Use the filter function and look for blanks in each of our two data columns (Tanda and Tanggal laporan). There appear to be no blanks.
Next look at the range of dates in the Tanggal laporan column. Examine the first and last dates in the data set. Are the dates reasonable? The range is from 1/5/2013 to the 16/04/2014 which does indeed appear reasonable.
Examine the categories of common signs reported in the Tanda column. Are these reasonable? There appear to be a few complexities. For example there are many diarrhoea entries, for example diarrhoea is listed as is diarrhoea w/blood and many other categories of diarrhoea.
However, generally the data appear good and there is no obvious reason why data analysis should not continue.
The main bias possible is selection bias. That is, farmers with healthy cattle will not report them to the authorities and hence the denominator will be under-estimated if examining prevalence of diarrhoea.
Exercise 10: New data
Answer
New month integer column
In order to analyse the data to see whether wet or dry season make a difference to the occurrence of diarrhoea we need to be able to categorise each report of a clinical sign into the wet or dry season. To do this will take several steps.
The date data is hard to use in Excel. We will therefore create a new column with an integer (whole number) corresponding to the number of the month in the year (i.e. months 1-12 with month 1 being January, 2 being February and Month 12 being December etc.).
Step 1: Date format
First check that Excel recognises the Tanggal laporan column as a column with date formatting.
Step 2: Date integer
To create the month integer column will require the Month formula (=Month(serial_number)). This formula returns an integer from 1-12 depending on the date in the referent cell (serial number). So name a new column "Month integer", fill in the month formula and drag this formula down the new column for the length of the entire dataset. For example, if the date data is in column B and you are in the first row below the heading the formula will read: =Month(B2). This will return 1 if the date in cell B2 is in January, 2 if the date in cell B2 is February etc.
Step 3: Wet or dry season categorisation
The next step is to create a new column "season" and populate this with a category of either "wet" or "dry" depending on whether the month of the date of report falls in the wet or dry season. To do this will require an ‘If’ formula. You may have covered these sorts of formulas in your previous Excel training. However, we have established a formula for you below. This is a nested formula which allows all the possible month integers that could be the wet season to be examined and Wet returned. So in this case if the month of the year is 1,2,3,4, 11 or 12 (January, February, March, April, November or December) then wet season is returned. Otherwise if none of the integers are in the wet season (May, June, July, August, September, October) then dry is returned. This categorisation will later allow us to compare the diarrhoea cases that occur in the wet and dry season.
=IF(C2=1,"Wet",IF(C2=2,"Wet", IF(C2=3,"Wet",IF(C2=4,"Wet",IF(C2=11,"Wet",IF(C2=12,"Wet","Dry"))))))
Drag this down the column to cover the entire data set so that every observation is given a wet or dry season classification.
Diarrhoea or other clinical sign:
We will also create a new column called ‘diarrhoea?’. This column will record whether the Tanda column has a diarrhoea sign or another type of sign. This will be useful for calculating prevalence later. Again we will use an If formula.
=IF(ISNUMBER(SEARCH("diarrhoea",G2)),"Diarrhoea","Other")
This formula actually says, if the cell on the row that records the Tanda (G2) contains the word "diarrhoea" anywhere in the cell, then record "Diarrhoea" in the new cell, otherwise, if G2 does not contain the word diarrhoea then record "Other" in the new cell.
Please see the video Case study 2_Exercise 10 for assistance in completing this exercise. You should achieve the following screenshot:
Exercise 11: Describing the new data columns with pivot tables.
Answer
As the data is categorical data (e.g. season, or month) it is not possible to use the data analysis tool we used in case study 1. Instead we need to use pivot tables to describe our data.
The video titled case study 2_exercise 11 describes the approach. However, the following tables should be produced in Excel using the data provided and the pivot table approach.
Exercises 12-14:
Answers to exercise 12, 13 and 14 are provided in the video and in the manual.
Case study 3: Slaughter statistics and beef production self-sufficiency.
Exercise 15: Do the hypotheses support the objective of the analyses?
Answer to exercise 15 is subjective with no specific answer. The purpose of the exercise is to encourage the participants to think about the hypotheses and overall objective.
Exercise 16: Create a pivot table of cattle slaughter statistics by month.
Answer:
Provided as a video. But is also included here.
Exercise 17: Do some summary plots for the contingency table.
Answer to exercise 17 is provided as a video
However, the following charts/plots should be produced in the exercise:
Exercise 18: Conduct a chi-squared statistical test on the pivot table.
Answer:
Chi-squared statistic = 3599.6, degrees of freedom = 15, p<0.000000001.
The critical value for rejecting the null hypothesis is 25, so 3599.6 definitely exceeds this value!
Reject the null hypothesis and accept the alternative hypothesis (there is dependence between the numbers of cattle slaughtered by category of cattle and month slaughtered).
Exercise 19: Inferences
Answer:
The most important result from the chi-squared statistical test is that the chi-squared value is very large. The cut point value for a chi-squared distribution with 15 degrees of freedom is 25, and our value of 3600 definitely exceeds this! We are therefore very confident that our data is not chi-squared distributed, and is dependant.
We reject the null hypothesis and accept the alternative hypothesis:
HA = There is an effect of month on total slaughter numbers in different categories of slaughtered cattle.
We now examine the chi-squared values for each cell and the observed and expected values, as well as the plots to determine where the dependence is most obviously occurring in the contingency table and therefore what effect is occurring.
It is obvious that the largest chi-squared values are found in later slaughter months for both Australian cattle and Indonesian male cattle. This is very useful to know as it indicates perhaps that reliance on imported cattle is declining to some extent whilst reliance on Indonesian slaughter cattle is increasing. This effect tends to support the notion that Indonesia is increasing its self-sufficiency in beef production.
It is worth making a cautionary note about this effect. This data is on a short time frame (only 6 months). To be sure of any trends a much longer time period of observation would be required and a much more rigorous statistical examination would be required. Additionally, some of the effect observed is certainly due to increasing numbers of slaughtered cattle being reported as iSIKHNAS reporting increases over time.