# LEARN | Module 2

**Biostatistics**

The ability to interpret and understand biostatistics is vitally important in making informed medical decisions based upon evidence. This week, we will discuss the difference between numerical and categorical data, describe what variability is and how to measure it, and elucidate the difference between p-values and confidence intervals. **As always, please post any questions, comments or suggestions in the Disqus comment feed at the bottom of the module!**

## Different types of data

### Categorical Data

One way in which data can be sorted is **categories**. For example, we can sort people based upon their blood type: A, B, AB or O. Each category must be **mutually exclusive** – this means that one person cannot belong to both Group A and Group O at the same time.

- If these categories can be ranked/ordered, then we further describe it as "categorical
**ordinal**data." For example, tumour stages can be ordered as: stage I, stage II, stage III or stage IV. - If the order of categories does not matter, then we describe it as "categorical
**n****ominal**data". For example, hair colour can be: red, brown, blonde, black, grey or other (there is no "best" hair colour). - When there are only two possible categories, we describe this as "
**dichotomous**".

### Numerical data

Numerical data means that the variables are **numbers**, and mathematical operations such as addition, subtraction and division can be performed in a logical way. For example, the different heights of people in a class represents numerical data: 164cm, 167cm, 182cm, 187cm and 190cm. These can be added together and then divided to find the average. In comparison, the different genders of people in a class do not represent numerical data: male + female = ?? does not make sense!

- If the data comes in the form of whole numbers, it is described as "numerical
**discrete**". For example, the number of children in a family can only be whole numbers: 0, 1, 2, 3, 4, etc. - If the data contains fractions/decimals, it is described as "numerical
**continuous**". For example, the different heights of students can be anything: 162.4cm, 162.56cm, 183.675cm, etc.

### Converting data

Importantly, it is possible to convert from numerical data to categorical data. For example, a numerical scale of pain rating from 1-10 can be converted so that all scores between 1 and 5 become "low" and all scores between 6 and 10 become "high". Low vs high now represent categorical data. *You cannot convert back the other way* - try it!

## Variability

For a detailed discussion on the normal distribution and variability, please consult page 2 (61) of the detailed notes, available here. Variability essentially refers to how wide the range of values are – if the shortest person in the class is 140cm and the tallest is 198cm, then variability would be very high!

## p-values and Confidence Intervals

### The p-value

The p-value essentially tells us how likely it is that the result obtained is due **merely to chance**. A more precise definition is that it tells us how likely it is to observe a relationship if the "null hypothesis" is true (null hypothesis = there is actually no relationship between the two phenomena measured). If the p-value is 0.2, then that tells us that there is a 20% probability that our result was only due to luck.

The following video will explain the p-value in more detail:

### Confidence Intervals

Studies will often present results in terms of the mean (average) of values. This is called the **sample** mean (what is observed). If one was to measure the value in every person in the entire population and find the average, this would be called the **population** mean (what is real).

The confidence interval is the range within which the population mean is likely to lie. This means that if the 95% confidence interval is 0.30 to 1.10, there is a 95% probability that the population mean is between these two numbers. It is impossible for studies to include the entire population, so it is better to look at the reported confidence interval, rather than only the sample mean.

- For example, imagine if a study is investigating the height of students at UPSM. The researchers only choose 10 people to measure, and they all happen to be quite tall! The average height of these 10 people is 189cm. The researchers will then present their data in the following way: average height = 189cm, 95% confidence interval = 162cm-201cm. This means that, even though they found the average height of their participants to be 189cm, they aren't sure if that is the average height of the whole school! They are, however, 95% sure that the average height of the whole school is something between 162cm to 201cm.

- If the study was repeated, a different sample mean and confidence interval would be obtained. If 100 studies were completed, then 95 of their confidence intervals would contain the population mean.

Further information is available in the detailed notes, along with information on different types of statistical tests that are used.

## Further reading

- Different types of data (YouTube video)
- Different types of statistical tests (YouTube video)
- Whitley E, Ball J. Statistics review 3: Hypothesis testing and P values. Critical Care 2002;6:222-225.
- 5. Spriestersbach A, et al. Descriptive statistics. Deutsches Arzteblatt International 2009;106:578–583.