I’m going to let you in on a little (maybe not so) secret: toxicology is in a crisis. A crisis that really calls a lot of the science into question.

What kind of crisis, you ask? A reproducibility crisis — meaning, we have to be careful about what toxicology studies we trust because we either definitively cannot reproduce the results (b/c someone tried), or we likely cannot replicate the results.

There are several facets driving this reproducibility crisis in toxicology.

One of the drivers of this crisis is the fact that scholarly toxicology journal editors and peer-reviewers rely heavily on p-values. A p-value is a complicated and misunderstood statistic that serves as a gatekeeper for whether or not a study will be published. If you want to know more about what a p-value is, and why p-values are a problem, see my article at Towards Data Science.

However, p-values aren’t the whole story — the bigger driver is actually small samples sizes in the studies, or something we biostats folks call “sampling error”. Let’s look at how small sample sizes can lead to sampling error in toxicology studies.

#### Small Studies = More Likely False Results

When we do pharmacology and toxicology studies, we generally want to know how a drug, chemical, food (contact) ingredient, or contaminant will impact a population. That population might be people, livestock, or animals in the environment.

We can’t just round up all the people in the world and expose them to a chemical (for practical reasons certainly, but also ethical concerns). So instead, we either test drugs in a smaller number of people, or we test them in a smaller number of animals.

In toxicology, even if we’re talking about drugs, we always test the drugs/chemicals in animals first, typically rodents, and then sometimes in some non-rodent species.

So what we do is we randomly sample these animals from the larger population of laboratory animals (typically from a vendor, but sometimes laboratories have their own large animal colonies).

So how does sampling work? To illustrate this we’re going to turn to an urn (or a bucket, a pouch, a container of some kind) and balls — like lottery balls (in statistics, we visualize a lot of our theoretical challenges as urns and balls). Another way to think of this is a bingo ball cage. Turn the cage with the crank to shuffle the balls around, pull one out and that’s your first human/animal sample! Turn the crank again, that’s your next sample! Each time you pull a ball representing a person/animal out of the bingo ball cage, you assign them to a group.

So, turn the crank, pull out a ball — B12. Okay, B12 is now assigned to the placebo/control group. Turn the crank again, pull out another ball — G5. Okay, G5, you’re assigned to the low dose group, congratulations! And you keep doing that until you have assigned everyone to a group.

The idea here is that if we randomly sample from the population, then we should be able to estimate something about the population using just the animals we sampled.

A good example of this is height in humans. Not every person is the same height. Some people are very tall. Some people are really short. Most people fall somewhere in between. We sometimes say that height follows a “normal” or “Gaussian” or “bell-shaped” distribution — that means that most of the population have a very similar height (near the average or median), and far fewer people are on the really short or really tall side of things, and there are people in between the mean/median and the really tall side and the really short side (see image below).

#### Small Sample Sizes = Big Trouble

So, we’ve covered the fact that in animal studies we’re sampling from a larger population. We’ve talked about the fact that the goal of sampling is to end up with a sample that is representative of the distribution. And in an ideal world, scientists are assigning people/animals to different study experimental groups randomly — yes, ideally they would have codified their laboratory animals/people, put those codes on the balls, put the balls into a bingo cage, turned the crank, and pulled out balls, until no balls are left.

We don’t live in an ideal world. At no time have I ever seen a laboratory actually use a bingo ball cage. I have seen some groups use software to randomize subjects (animals/humans) into groups. But I have seen far more laboratory scientists assign their animals by grabbing them from a cage and assigning groups (and no, grabbing animals out of a cage is not random; aside: we can normally tell when animals are “random grabs” from a cage because the groups are typically stratified by weight — I haven’t published that observation in a peer-reviewed journal, but based on my years of quality assurance audits “random grabs” are typically associated with different groups having different starting weights).

In the field of toxicogenomics (where scientists are measuring lots of RNA, proteins, or metabolites), it’s rather common to see studies where each experimental group has only 3 animals (actually, 2 animals is also common). The justification typically given for 3 animals is that the toxicogenomic technology is so expensive that the laboratory couldn’t possibly afford to use more animals.

So what happens when we sample using 3 replicates?

Look at the figure above. The histogram is the population distribution. Each line segment is a different sampling group, where I grabbed 3 samples from the population at random. I’ve done this 10 times (so there are 10 line segments). You can see that none of the sampling groups really replicate the population.

From a statistical standpoint, what I really want to see is that the average (mean) and median line up with each other, and that they also line up with the vertical line at 100 on the histogram. In other words, in an ideal sampling, I would have randomly grabbed 3 samples, where the median and the average line up on the vertical 100 line. That is not happening.

My averages are close to 100 in 3 out of the 10 sampling groups — that’s 30%. That’s not good at all. That means that in 70% of the cases we’re not close to the average.

In fact, in only one case does my sample group actually come close to looking like the spread in the data — 1 case out of 10!

In other words, in 90% of the cases, my random sampling doesn’t look anything like the population I was sampling from — that’s not good. In 30% of cases, the averages are kinda close to the population average.

#### What About 5, 10, and 20 Samples Per Group?

So what is a decent number of samples per group? Is 5 good? What about 10 or 20? Well, I’m glad you asked, because here’s some graphs.

Let’s start with 5 samples per group:

Again, in an ideal world we’d see all of the means lined up on the vertical 100 line — we don’t see that. The means are close in 3 of 10 cases — 30%. And only one of them really looks like the spread in the data. Clearly 5 isn’t cutting it at all.

What about 10 samples per group? Is that better?

10 certainly looks better. In this case we have 4 out of 10 (40%) that have means really close to the vertical 100 line. We also have more cases of samples looking more like the distribution — 3 (I’m including the second from the top — it barely makes my criteria). So at least that’s now 30%.

And how about 20 samples per group? Is that any better?

So now we have 6 out of 10 that have really close means to the vertical 100 line (that’s 60%!). The number that look like the distribution is now 7 — that’s 70% — not bad! A clear improvement over the sample size of 3 samples per group.

#### Practically speaking, what does this mean?

So what does this mean, and how can you use this information?

It means, simply, that studies with a group size of 3 are not reliable. They are not reliably telling us anything about the population. More often than not, they don’t give us an accurate representation of the population.

In other words — be careful when interpreting results from a pharmacology/toxicology study with only 3 subjects (animals, humans, etc).

Always treat small pharmacology and toxicology studies very skeptically. As you can see, in this particular example, 5, 10 and 20 animals/humans per group is more reliable than 3. This is why, historically, toxicity tests used for regulatory purposes required typically 10-20 animals. Today, for various reasons, groups are trying to get that number lower (a topic for another post).

#### What You Should Do

Next time you see something in the press that says “hey this chemical/drug is really bad for you” — stop. Find the study. Go to the Materials and Methods section. Find out how many animals they used per group. Then go out and see if there are any other similar studies. See if there are any larger studies. See how the results differ between the larger study and the smaller study. If there are lots of studies, consider what all of them are saying overall.

Is there a rule of thumb for a good sample size per group? The only real rule of thumb is that we need to seriously question studies with small sample sizes. We should be seriously suspect of any study that uses 3 or fewer subjects per group. Beyond that, you really need to talk with an expert in biostatistics AND toxicology.

Why do I say biostatistics AND toxicology? Because you need a biostatistician who understands toxicology studies, has spent time in laboratories, and understands what is normal and what isn’t normal. A standard toxicologist typically lacks the biostatistics knowledge required to scrutinize the statistics, and may have been raised in a culture that accepts small sample sizes, or routinely performs incorrect statistical analyses. You need someone trained in both areas.

That’s where the experts at Raptor Pharm & Tox, Ltd can help.

#### Raptor Pharm & Tox, Ltd Has Toxicologists Cross-Trained in Biostatistics

Our experts in the scientific integrity investigation practice are are cross-trained in toxicology and biostatistics. We can help you assess studies in the published literature and the gray literature to identify their strengths and weaknesses. Contact us to find out how we can help you.