“Is it significant?” is the wrong question

Author: Frank Buckler, Ph.D.

Founder of CX-AI.com and CEO of Success Drivers
// Pioneering Causal AI for Insights since 2001 //
Author, Speaker, Father of two, a huge Metallica fan.

Author: Frank Buckler, Ph.D.
Published on: October 6, 2023 * 8 min read

The other day I noticed a post on LinkedIN by an esteemed acquaintance, Ed Rigdon, Professor at Georgia State University in Atlanta USA. I read, “Even though significance tests date back only to the 1930s when Fisher promoted them, and even though they are logically flawed and clearly retard knowledge growth, social science researchers are enslaved by this very bad idea. Even now that the American Statistical Association has explicitly rejected the concept, researchers cling to it desperately.”

Photo: Thomas Fedra

Wow. Tough stuff. Can that be true? In business, when you look at market research and analysis results, two questions often come up: is this representative and is this significant? In other words, “is it true for everyone” and “is the impact large”? If we ignore representativeness today, it is common practice – both in science and in companies – to answer with the P-value. It is also known as significance value or probability of error. With 0.001 the P-value is really good and in practice sometimes even 0.1 is accepted as sufficient. The significance is used as a judge between right and wrong.

I look at the official statement of the American Statistical Association on the P-value:

Principle 1: P-values can indicate whether data are inconsistent with a particular statistical model.
Principle 2: P values do not measure the probability that the hypothesis under study is true or the probability that the data are due to chance alone.
Principle 3: Scientific conclusions and business or policy decisions should not be based solely on whether a P value exceeds a certain threshold.
Principle 4: Proper conclusions require full reporting and transparency.
Principle 5: A P value or statistical significance is not a measure of the size of an effect or the significance of a result.
Principle 6: A P value alone is not a good measure of evidence for a model or hypothesis.

There now seems to be a scientific consensus in the American Statistical Association that the practice of using significance values is not very useful, if not dangerous. Why dangerous? Because it declares relationships to be “true” that, when viewed holistically, are not. Therefore, wrong decisions and low ROI are preprogrammed for the practice. Read here.

Get your FREE hardcopy of the “CX Insights Manifesto”

FREE for all client-side Insights professionals.
We ship your hardcopy to USA, CA, UK, GER, FR, IT, and ESP.

Significance can be created arbitrarily

Language usage is a bad advisor. We all know sentences like “Mr. Y had a significant impact on X”. What is meant here is “great” or “significant.” But if something is statistically significant, it does not mean that it must be significant. On the contrary, something statistically significant can also be very small and insignificant. What is significant is that a correlation between data is so clear that it is not based on chance.

In market research parlance, significant is thus “proven to be true.” But nothing is set in stone. In science, the term “p-hacking” has emerged. Models, hypotheses and data are changed and trimmed until the P-value is below the targeted threshold. If “P-hacking” is already commonplace in science, what does it look like in market research practice?

Significance has nothing to do with relevance

In practice, just about any correlation becomes significant if only the sample is large enough. Significance does not measure how strong a correlation is, but whether it can be assumed to be true or present.

A strong correlation usually needs a smaller sample to become significant. This phenomenon often leads to the misunderstanding that significance also measures relevance. This is not the case. Every Swabian is a German but not every German is a Swabian. So a minimal effect can be significant. But all this is clear to many market researchers. The real problem lies elsewhere:

Join the World's #1 "CX Analytics Masters" Course

Significance in itself says nothing

Who doesn’t know them, the harebrained examples of correlation like the one between the “age of Miss America of a year” and the “number of murders with steam or other hot objects”. At N=8 years, this statistic already has a P-value of 0.001.

Correlation is not causation

An example could not be clearer to show how unsuitable the P-value is for testing a correlation for its truth. But why is it still used as a judge of right or wrong? The answer is many and varied.

Some will say “the customer wants it that way”. But the more relevant question is: How does right go? How can I find out if a context is to be trusted?

To answer that, let’s revisit the P-value used to assess: The P-value wants to judge differences found OR correlations found.

Questions about differences are: “Do more customers buy product X or Y”. The result of a survey gives two results, which are then compared. A comparison using a significance test is limited if:

Representativeness is limited.

If a smartphone brand surveys only young people, the result will not reflect that of the entire population because older people have different needs. If the sample does not represent the population well, the results will not be accurate. However, only characteristics of the people that have an influence (see context) on the measurement result are relevant. This is exactly the point that is usually overlooked. It is practical to quote only by age and gender without checking whether these are the relevant representativeness drivers.

The measurement is biased.

I can ask consumers “would you buy this cell phone”. But whether the answer is then true (i.e., unbiased) is another matter. A central focus of marketing research in recent years has been to develop valid scales. Implicit measurement methods have been added more recently. The art of questionnaire design is one of them.

Questions on correlations, on the other hand, are: “Do customers in the target group buy my product more than other target groups”. Behind this is the assumption that the target group characteristic is causal for the purchase. A connection is assumed between consumer characteristics and willingness to buy. It is no longer merely a matter of showing the difference between target groups, because this would be tantamount to a corellation analysis, which is known to be a poor form of analysis for correlations, as the example above shows. In my opinion, the question of correlations is not discussed enough, although it has a very special meaning.

The Evidence Score

The vague term “context” is about something very concrete: a causal effect relationship. All business decisions are based on it. They are based on assumptions about causal impact relationships. “If I do X, then Y will happen”. Discovering, exploring and validating these “relationships” is what (consciously or unconsciously) most market research is about.

But whether we can trust a statement about a context is indicated by the product of the following three criteria:

Completeness (C for Complete): How many other possible reasons and conditions are there that also have an influence on the target variable, but have not been considered in the analysis so far. One can express this with a subjective probability (an a priori probability in the Bayesian sense): 0.8 for “pretty complete”, rather 0.2 for “actually most of it is missing”, or 0.5 “the most important is in there”).

But why is completeness so important? Example: shoe size has some predictive power for career success because, for various reasons, men climb the career ladder higher on average and have larger feet. If one does not include gender in the analysis, there is a great risk of falling for spurious effects. Causal researchers call this “the confounder problem.” Confounders are unconsidered variables that influence cause and effect at the same time. Even today, most driver models are calculated with “only a handful” of variables, and the risk of spurious findings is therefore high.

The issue of representativeness logically belongs to completeness. This is because one either ensures a representative sample (which is more or less impossible) and controls for biasing factors, or one measures the factors that influence the relationships being measured (demographics, shopper types, etc.) and integrates them into the multivariable analysis of the relationships. I’ll go into this topic in detail once in another post (tentative title: “AI saves representativeness”).

Correct Direction of Action (D for Directed correctly): How sure can we be that A is cause of B and not vice versa? There one can often fall back on previous knowledge, possibly one has longitudinal data. Otherwise statistical methods of “d-separation” (e.g. PC-algorithm) have to be applied. So again it is about the questions how the subjective probability is: rather 0.9 for “well, that is well documented” or 0.5 for “well, that could be so or so”?

Predictive Power (P for Prognostic): How much variance in the effected variable does the cause explain? Measures of effect size measure the absolute proportion of the variance explained that is made possible by a variable. Nobel Prize winner G. Granger once stated in his research: In a complete (C), properly directed (D) model, the explanatory power of a variable proves its direct causal influence.

If any of the three variables C, D or P hapens, the evidence for the relationship is very thin. This is because all three aspects are interdependent. Prognostic power without completeness or the right direction is worthless.

Mathematically, all three values can be multiplicatively combined. If one is small, the product is very small:

Evidence = C x D x P

This Evidence is a proven tool in Bayesian information theory, as well as a viable and useful value for making a judgment for a context.

Keep Yourself Updated

On the Latest Indepth Thought-Leadership Articles From Frank Buckler

Moving away from black-and-white thinking

If the evidence is high, then … yes, then one may ask again: what is the significance value of the correlation? But this is not about reaching a threshold value because the P-value cannot be the only criterion for whether one recognizes a finding. The P-value as a continuous measure is more informative and tells us how stable the statement we have in mind is. Nothing more. But no less, either.

Ed writes again...

I’m back on LinkedIn and re-reading a post by Ed Rigdon introducing his paper on the topic. He writes:

“How about just treating P-values as the continuous quantities they are? Don’t allow an arbitrary threshold to turn your P value into something else, and don’t misinterpret it. And while you’re at it, remember that your statistical analysis probably missed a lot of uncertainty, which means your confidence intervals are probably way too narrow.”

The next time a client asks me what the P-value is, I tell them, “I don’t recommend they look at the P-value when it comes to trustworthiness. We measure the evidence score these days. Ours is 0.5, and the last model’s based on which you made a decision at high significance was only 0.2.”

LITERATURE

[1] “How improper dichotomization and the misrepresentation of uncertainty undermine social science research,” Edgar Rigdon, Journal of Business Research, Volume 165, October 2023, 114086.

[2] “The American Statistical Association statement on P-values explained,” Lakshmi Narayana Yaddanapudi, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5187603/