srhe

The Society for Research into Higher Education


Leave a comment

The Doorknob and the Door(s): Why Bayes Matters Now

By Paul Alper

When I was young, there was a sort of funny story about someone who invented the doorknob but died young and poor because the door had yet to be invented. And, perhaps the imagery is backwards in that the door existed but was useless until the doorknob came into being but I will stick with the doorknob coming first in time. Bear with me as I attempt to show the relevance of this to the current meteoric rise of Bayesianism, a philosophy and concept several centuries old. 

In a previous posting, “Statistical Illogic: the fallacy of Jacob Bernoulli and others,” I reviewed the book, Bernoulli’s Fallacy by Aubrey Clayton.  He shows in great detail how easy it is to confuse what we really should want

Prob(Hypothesis| Evidence)                                       Bayesianism

with

Prob(Evidence | Hypothesis)                                       Frequentism

A classic instance of Bayesian revision in higher education would be the famous example at the Berkeley campus of the University of California. In the 1970s, it was alleged that there was discrimination against females applying to graduate school.  Indeed, male admission rate overall was higher than female admission rate.  But, according to https://www.refsmmat.com/posts/2016-05-08-simpsons-paradox-berkeley.html, the simple explanation

“is that women tended to apply to the departments that are the hardest to get into, and men tended to apply to departments that were easier to get into. (Humanities departments tended to have less research funding to support graduate students, while science and engineer departments were awash with money.) So women were rejected more than men. Presumably, the bias wasn’t at Berkeley but earlier in women’s education, when other biases led them to different fields of study than men.”

Clayton’s examples, such as the Prosecutor’s Fallacy and medical testing confusion, give no hint of how analytically difficult it was to perform the calculations of Bayes Theorem in complicated situations. Except for a paragraph or two on pages 297 and 298 he makes no reference to how and why Bayesianism calculations can now be done numerically on very complicated, important, real-life problems in physics, statistics, machine learning, and in many other fields, thus the proliferation of Bayesianism.

For the record, the one and only picture of the reverend Thomas Bayes is generally considered apocryphal; ditto regarding the one and only picture of Shakespeare.   Bayes died in 1761 and his eponymous theorem was presented to The Royal Society in 1763 by Richard Price, an interesting character on his own.

What has changed since the inception of Bayes Theorem more than two centuries ago, the door knob if you will, is the advent of the door: World War II and the computer. At Los Alamos, New Mexico, the place that gave us the atom bomb, five people were confronted with a complicated problem in physics and they came up with a numerical way of solving Bayes Theorem via an approach known as MCMC, which stands for Markov Chain Monte Carlo. Their particular numerical way of doing things is referred to as the “Metropolis Algorithm” named after Nicholas Metropolis, the individual whose name was alphabetically first.

To give the flavour but not the details of the Metropolis algorithm, I will use a well-done, simple example I found on the web which does not use or need the Metropolis algorithm but can be solved simply using straightforward Bayes Theorem; then I show an inferior numerical technique before indicating how Metropolis would do it. The simple illustrative example is taken from the excellent web video, Bayes theorem, the geometry of changing beliefs (which in turn is taken from work by Nobel prize-winning psychologists Daniel Kahneman and Amos Tversky).

The example starts with ‘Steve’, a shy retiring individual. We are asked to say which is more likely – that he is a librarian, or a farmer? Many people will say ‘librarian’, but that is to ignore how many librarians and how many farmers there are in the general population. The example suggests there are 20 times as many farmers as librarians, so we start with 10 librarians and 200 farmers and no one else.  Consequently,

Prob(Librarian) is 10/(10 +200) = 1/21.

Prob(Farmer) is 200/(10 +200) = 20/21

The video has 4 of the 10 Librarians being shy and 20 of the 200 farmers being shy; a calculation shows how the evidence revises our thinking:

Prob(Librarian| shyness) = 4/(4+20) = 4/24 = 1/6

Prob(Farmer| shyness) = 20/(4+20) = 5/6

‘Steve’ is NOT more likely to be a librarian. The probability that ‘Steve’ is a librarian is actually one in six. Bayesian revision has been calculated and note that the results are normalised.  That is, the 5 to 1 ratio, 20/4, trivially leads to 5/6 and 1/6.  Normalisation is important in order to calculate mean, variance, etc. In more complicated scenarios in many dimensions normalisation remains vital but difficult to obtain.

The problem of normalisation can be solved numerically but not yet in the Metropolis way.  Picture a 24-sided die.  At each roll of the die, record whether the side that comes up is a number 1,2,3 or 4 and call it Librarian. If any other number, 5 to 24, comes up, call it Farmer.  Do this (very) many thousands of times and roughly 1/6 of those tosses will be librarian and 5/6 will be farmer.  This sampling procedure is deemed independent in that a given current toss of the die does not depend on what tosses took place before.  Unfortunately, this straight-forward independent sampling procedure does not work well on more involved problems in higher dimensions.

Metropolis does a specific dependent sampling procedure, in which the choice of where to go next does depend on where you are now but not how you got there, ie  the previous places you visited play no role.  Such a situation is called a Markov process, a concept which  dates from the early 20th century. If we know how to transition from one state to another, we typically seek the long-run probability of being in that state. In the Librarian/Farmer problem, there are only two states, Librarian and Farmer. The Metropolis algorithm says begin in one of the states, Librarian or Farmer, toss a two-sided die which proposes a move.  Accept this move as long as  you do not go down. So, moving from Librarian to Librarian, Farmer to Farmer or Librarian to Farmer are accepted. Moving from Farmer to Librarian may be accepted or not; the choice depends on the relative heights – the bigger the drop, the less likely the move is to be accepted.  Metropolis says: take the ratio, 4/20, and compare to a random number between zero and one. If the random number is less than 4/20, move from Farmer to Librarian; if not, stay at Farmer. Repeat the procedure (very) many, many times.

Typically, there is a burn-in period so the first bunch are ignored and we count from then on the fraction of the runs that we are in the Librarian state or in the Farmer state, to yield the 1/6 and 5/6.

Multiple thousands of iterations today take no time at all; back in World War II, computing was in its infancy and one wonders how many weeks it took to get a run which today, would be done in seconds.  But, so to speak, a door was being constructed. 

In 1970, Hastings introduced an additional term so that for complex cases, the proposals and acceptances would better capture more complex, involved “terrain” than this simple example. In keeping with the doorknob and door imagery, Metropolis Hastings is a better door, allowing us to visit more complicated, elaborate terrain more assuredly and more quickly.  An even newer door, inspired by problems in physics, is known as the  Hamiltonian MCMC.  It is even more complicated, but it is still a door,related to previous MCMC doors.  There are many web sites and videos attempting to explain the details of these algorithms but it is not easy going to follow the logic of every step.  Suffice to say, however, the impact  is enormous and justifies the resurgence of Bayesianism.

Paul Alper is an emeritus professor at the University of St. Thomas, having retired in 1998. For several decades, he regularly contributed Notes from North America to Higher Education Review. He is almost the exact age of Woody Allen and the Dalai Lama and thus, was fortunate to be too young for some wars and too old for other ones. In the 1990s, he was awarded a Nike sneaker endorsement which resulted in his paper, Imposing Views, Imposing Shoes: A Statistician as a Sole Model; it can be found at The American Statistician, August 1995, Vol 49, No. 3, pages 317 to 319.


Leave a comment

Statistical illogic: the fallacy of Jacob Bernoulli and others

by Paul Alper

Bernoulli’s Fallacy, Statistical Illogic and the Crisis of Modern Science by Aubrey Clayton.  

“My goal with this book is not to broker a peace treaty; my goal is to win the war.”    (Preface p xv)

“We should no more be teaching p-values in statistics courses than we should be teaching phrenology in medical schools.” (p239)

It is possible or even probable that many a PhD or journal article in the softer sciences has got by through misunderstanding probability and statistics. Clayton’s book aims to expose the shortcomings of a fallacy first attributed to the 17th century mathematician Jacob Bernoulli, but relied on repeatedly for centuries afterwards, despite the 18th century work of statistician Thomas Bayes, and exemplified in the work of RA Fisher, the staple of so many social science primers on probability and statistics.

In the midst of the frightening Cold War, I attended a special lecture at the University of Wisconsin-Madison on 12 February 1960 by Fisher, the most prominent statistician of the 20th century; he was touring the United States and other countries. I had never heard of him and indeed, despite being in grad school, my undergraduate experience was entirely deterministic: apply a voltage then measure a current, apply a force then measure acceleration, etc. Not a hint, not a mention of variability, noise, or random disturbance. The general public’s common currency in 1960 did not then include such terms as random sample, statistical significance, and margin of error. 

However, Fisher was speaking on the hot topic of that day: was smoking a cause of cancer?  Younger readers may wonder how in the world was this a debatable subject when in hindsight, it is so strikingly obvious. Well, it was not obvious in 1960 and the history of inflight smoking indicates how difficult it was to turn the tide, and how many years it took. Fisher’s tour of the United States was sponsored by the tobacco industry, but it would be wrong to conjecture that he was being hypocritical. And not just because he was a smoker himself.  

Fisher believed that mere observations were insufficient for concluding that A causes B; it could be that B causes A or that C is responsible for both A and B. He insisted upon experimental and not mere observational evidence. According to Fisher, it could be that people who have some underlying physical problem led them to smoke rather than smoking caused the underlying problem; or that some other cause such as pollution was to blame. According to Fisher, in order to experimentally link smoking as the cause of cancer, at random some children would be required to smoke and some would be required not to smoke and then as time goes by note the incidence of cancer in each of the two groups.

However, according to Clayton, Fisher himself, just like Jacob Bernoulli, had it backwards when it came to analysing experiments.  If Fisher and Bernoulli can make this mistake, it is easy for others to fall into this trap because ordinary language keeps tripping us up.  Clayton expends much effort into showing examples, such as the famous Prosecutor’s Fallacy. The fallacy was exemplified in the UK by the infamous Meadows case and is discussed at length by Clayton; a prosecution expert witness made unsustainable assertions about the probability of innocence being “one in 73 million”.

The Bayesian way of looking at things is to consider the probability a person is guilty, given the evidence. This is not the same as the probability of the evidence, given the person is guilty, which is the ‘frequentist’ approach adopted by Fisher, with results which can be wildly different numerically. Another example, from the medical world: there is confusion between the probability of having a disease, given a positive test for the disease:

                        Prob (Disease | Test Positive) ; the Bayesian way of looking at things

and

                         Prob (Test Positive | Disease) ; the frequentist approach

The patient is interested in the former but is often quoted the latter, known as the sensitivity of the test, which might be markedly different depending on the base rate of the disease. If the base rate is, say, one in 1,000 and the test sensitivity is, say, 90%, then for every 1000 tests, 100 will be false positives. A Bayesian would therefore conclude correctly that the chances of a false positive test are 100 times greater than the chances of actually having the disease. In other words, the hypothesis that the person has the disease is not supported by the data/evidence. However a frequentist might mistakenly say that if you test positive there is a 90% chance that you have the disease.

The quotation from page xv of Clayton’s preface which begins this essay, shows how much Clayton, a Bayesian, is determined to counter Bernoulli’s fallacy and set things straight. Fisher’s frequentist approach still finds favor among social scientists because his setup, no matter how flawed, was an easy recipe to follow. Assume a straw-man hypothesis such as ‘no effect’, take data to obtain a so-called p-value and, in the mechanical manner suggested by Fisher, if the p-value is low enough, reject the straw man. Therefore, the winner was the opposite of the straw man, namely the effect/hypothesis/contention/claim is real.

Fisher, a founder, and not just a follower of the eugenics movement, was as I once wrote, “a genius, and difficult to get along with.”  Upon reflection, I consequently changed the conjunction to an implication, “a genius, therefore difficult to get along with.”  His then son-in-law back on 12 February 1960 was George Box, also a famous statistician – among other things the author of the famous phrase in statistics, “all models are wrong, some are useful” – who had just been appointed to be the head of the University of Wisconsin’s statistics department. Unlike Fisher, Box was a very agreeable and kindly person and, as evidence of those qualities, I note that he was on the committee that approved my PhD thesis, a writing endeavour of mine which I hope is never unearthed for future public consumption.  

All of that was a long time ago, well before the Soviet Union collapsed, only to see today’s military rise of Russia. Tobacco use and sales throughout the world are much reduced while cannabis acceptance is on the rise. Statisticians have since moved on to consider and solve much weightier computational problems via the rubric of so-called Data Science. I was in my mid-twenties and I doubt that there were many people younger than I was at that Fisher presentation, so I am on track to be the last one alive who heard a lecture by Fisher disputing smoking as a cause of cancer.  He died in Australia in 1962, a month after my 26th birthday but his legacy, reputation and contribution live on and hence, the fallacy of Bernoulli as well.    

Paul Alper is an emeritus professor at the University of St. Thomas, having retired in 1998. For several decades, he regularly contributed Notes from North America to Higher Education Review. He is almost the exact age of Woody Allen and the Dalai Lama and thus, was fortunate to be too young for some wars and too old for other ones. In the 1990s, he was awarded a Nike sneaker endorsement which resulted in his paper, Imposing Views, Imposing Shoes: A Statistician as a Sole Model; it can be found at The American Statistician, August 1995, Vol 49, No. 3, pages 317 to 319.