How (Not) to Test for Algorithmic Bias (Guest Post)

8/22/2020

This is a guest post by Brian Hedden (University of Sydney).
(3000 words; 14 minute read)

Predictive and decision-making algorithms are playing an increasingly prominent role in our lives. They help determine what ads we see on social media, where police are deployed, who will be given a loan or a job, and whether someone will be released on bail or granted parole. Part of this is due to the recent rise of machine learning. But some algorithms are relatively simple and don’t involve any AI or ‘deep learning.’

As algorithms enter into more and more spheres of our lives, scholars and activists have become increasingly interested in whether they might be biased in problematic ways. The algorithms behind some facial recognition software are less accurate for women and African Americans. Women are less likely than men to be shown an ad relating to high-paying jobs on Google. Google Translate translated neutral non-English pronouns into masculine English pronouns in sentences about stereotypically male professions (e.g., ‘he is a doctor’).

When Alexandria Ocasio-Cortez noted the possibility of algorithms being biased (e.g., in virtue of encoding biases found in their programmers, or the data on which they are trained), Ryan Saavedra, a writer for the conservative Daily Wire, mocked her on Twitter, writing “Socialist Rep. Alexandria Ocasio-Cortez claims that algorithms, which are driven by math, are racist.”

I think AOC was clearly right and Saavedra clearly wrong. It’s true that algorithms do not have inner feelings of prejudice, but that doesn’t mean they cannot be racist or biased in other ways.

But in any particular case, it’s tricky to determine whether a given algorithm is in fact biased or unfair. This is largely due to the lack of agreed-upon criteria of algorithmic fairness.

This lack of consensus can be usefully illustrated by the controversy over the COMPAS algorithm used to predict recidivism. (It’s so famous that the Princeton computer scientist Arvind Narayanan jokes that it’s mandatory to mention COMPAS in any discussion of algorithmic fairness!)

In a major report for ProPublica, researchers concluded that COMPAS is ‘biased against blacks,’ to quote the headline of their article. They reached this conclusion on the grounds that COMPAS yielded a higher false positive rate (non-recidivists incorrectly labelled high-risk) for black people than for white people, and a higher false negative rate (recidivists incorrectly labelled low-risk) for white people than for black people.

Northpointe, the company behind COMPAS, responded to ProPublica’s charge, noting that COMPAS was equally accurate for black and white people, in the sense that their risk scales had equal AUC’s (areas under the ROC curve). (Roughly, the AUC, applied to the case at hand, represents the probability that a random recidivist will be ranked as lower risk than a random non-recidivist. I won’t get into the technical details of this concept, but see here for some background.) And Flores, Bechtel, and Lowenkamp defended COMPAS on the grounds that, for each possible risk score, the percentage of those assigned that risk score who went on to recidivate was roughly the same for black and for white people.

It seems that ProPublica was tying fairness to one set of criteria, while Northpointe and Flores et al. were tying fairness to a different set of criteria. How should we decide which side was right? How should we decide whether COMPAS was really unfair or biased against black people? More generally, how should we decide whether an algorithm is unfair or biased?

Before jumping into this discussion, it’s worth pointing out that the debate over algorithmic fairness also bears on the fairness of human predictions and decisions. We can, after all, think of human prediction and decision-making as based on an underlying algorithm. And some possible criteria for what it takes for an algorithm to be fair, including those we’ll focus on below, can be applied to any set of predictions or decisions whatsoever, regardless of the nature of the underlying mechanism that produces them.

2400 words left

Statistical Criteria of Fairness
Let’s focus on algorithms like COMPAS. These algorithms make predictions, not decisions, though of course their predictions might be used to feed into decisions about bail, parole, and the like. The algorithms in question take as input a ‘feature vector’ (a set of known characteristics) and output either a risk score, or a binary prediction, or both. For simplicity, let’s focus on algorithms that output both a real-valued risk score between 0 and 1 and a binary (yes/no) prediction. We can think of the risk score as a probability that the individual will fall into the ‘positive’ class, and the prediction as akin to a binary belief about whether the individual will be positive or negative.

What criteria must a predictive algorithm satisfy in order to qualify as fair and unbiased? Some criteria concern the inner workings of the algorithm. Perhaps a fair algorithm must not use group membership as part of the feature vector upon which its predictions are based. It must be blinded to whether the individual is male or female, black or white, and so on. Perhaps fairness also requires that the algorithm be blinded to any ‘proxies’ for group membership. For instance, we might regard ZIP code as a proxy for race, given that housing in the US is highly segregated. It is a difficult matter to say in general when some feature counts as a proxy in a problematic sense, but the basic idea is clear enough.

Fairness also presumably requires that the algorithm use the same threshold in moving from a risk score to a binary prediction. It would be unfair, for instance, if black people assigned a risk score above 0.7 were predicted to recidivate, while only white people assigned a risk score above 0.8 were predicted to recidivate. These criteria are relatively uncontroversial and relatively easy to satisfy (except for the tricky issue of proxies for group membership). But are there any other criteria that an algorithm must satisfy in order to be fair and unbiased?

This post will be concerned with a class of purported fairness criteria that require that certain relations between predictions and actuality be the same across the relevant groups. I’ll call these ‘statistical criteria of fairness.’ These are the sorts of criteria that are at issue in the debate over COMPAS. They are of interest in part because we can determine whether they are satisfied by some algorithm just by looking at what it predicted and what actually happened. We don’t need to look at the inner workings of the algorithm, which may be proprietary or otherwise opaque. (This opacity is itself a problem, and we should seek as much transparency as possible going forward.)

Here are the main statistical criteria of fairness at issue in the debate over COMPAS. See the Appendix for several more that have been considered and discussed in the literature.

(1) Calibration Within Groups: For each possible risk score, the percentage of individuals assigned that risk score who are actually positive is the same for each relevant group and equal to that risk score.
(2) Equal False Positive Rates: The percentage of actually negative individuals who are falsely predicted to be positive is the same for each relevant group
(3) Equal False Negative Rates: The percentage of actually positive individuals who are falsely predicted to be negative is the same for each relevant group.

It’s pretty easy to see why each seems like an attractive criterion of fairness. If an algorithm violates (1) Calibration Within Groups, then it would seem that a given risk score has different evidentiary value for members of different groups. A given risk score would ‘mean’ different things for different individuals, depending on which group they are members of. If an algorithm violates (2), yielding a higher false positive rate for one group than for another, it’s tempting to conclude that it was being more ‘risky,’ or was jumping to conclusions more quickly, with respect to one group versus another. The same goes if it violates (3), yielding a higher false negative rate for one group than for another. And this seems unfair. It seems to conflict with the idea that individuals should be treated the same by the algorithm, regardless of their group membership.

1700 words left

Impossibility Theorems
It would be nice if some algorithms could satisfy all of these criteria. This wouldn’t mean that the algorithm is in fact fair. Even if each of these statistical criteria is necessary for fairness, they are not jointly sufficient – we saw above that there are additional, non-statistical criteria that must be satisfied as well. But still, it would be a promising start if an algorithm could satisfy all of these statistical criteria.

But it is impossible for an algorithm to satisfy all of these criteria, except in marginal cases. This is the upshot of a series of impossibility theorems. Two such theorems are particularly important. Kleinberg et al. prove that no algorithm can jointly satisfy (1) and close relatives of (2) and (3) unless either (i) base rates (i.e. the percentage of individuals who are in fact positive) are equal across the relevant groups, or (ii) prediction is perfect (i.e. the algorithm assigns risk score 1 to all positive individuals and 0 to all negative individuals). Chouldechova proves that no algorithm can jointly satisfy (2) and (3) and a close relative of (1), again unless base rates are equal or prediction is perfect.

I won’t go through the proofs of these impossibility theorems, but they’re not terribly technical. And here’s a great explanation of the theorems and their importance.

What should we make of these theorems? Pessimistically, we might conclude that fairness dilemmas are all but inevitable; outside of marginal cases, we cannot help but be unfair or biased in some respect.

More optimistically, we might conclude that some of our criteria are not in fact necessary conditions on algorithmic fairness, and that we need to take a second look and sort the genuine fairness conditions from the specious ones. This is the tack I will take.

1400 words left

People, Coins, and Rooms
How could we go about determining which (if any) of the above statistical criteria are genuinely necessary for fairness? One methodology would be to go one-by-one, looking at the motivations behind each criterion, and seeing if those motivations stand up to scrutiny.

Another methodology would be to find a perfect, 100% fair algorithm and see if that algorithm can violate any of those criteria. If it can, then this means that the criterion isn’t necessary for fairness. (If it can’t, this doesn’t mean that the criterion is necessary for fairness; perhaps some other 100% fair algorithm can violate it.) But this methodology may seem unpromising. It would be hard, if not impossible, to find any predictive algorithms that everyone agrees is perfectly, 100% fair.

At least, that is the case if we consider algorithms that predict important, messy things like recidivism and professional success. But we can do better by considering coin flips; this will enable us to make use of this second methodology.

Here is the setup: There are a bunch of people and a bunch of coins of varying biases (i.e. varying chances of landing heads). The people are randomly assigned to one of two rooms, A and B. And the coins are randomly assigned to people. So each person has one coin and is in one of our rooms. Our aim is to predict, for each person, whether her coin will land heads or tails. That is, we are trying to predict, for each person, whether they are a heads person or a tails person. Luckily, each coin comes helpfully labelled with its bias.

Here is a perfectly, 100% fair algorithm: For each person, take their coin and read its labelled bias. If the coin reads ‘x’, assign it a risk score of x. If x>0.5, make the binary prediction that it will land heads. If x<0.5, make the binary prediction that it will land tails. (I assume, for simplicity, that none of the coins are labelled ‘0.5.’)

It should be clear that this algorithm is perfectly, 100% fair. This, I think, is bedrock. It’s certainly an odd setup, and there’s some unfortunate randomness, but there’s no unfairness anywhere in the setup––and in particular not in our algorithm itself.

Let’s see how our criteria shake out with respect to this algorithm. A first thing to note is that criteria (1)-(3) were formulated in terms of what outcomes actually result. But with coin flips, anything can happen. No matter how biased a coin is in favour of heads (short of having heads on both sides), it can still land tails, and vice versa. So it’s actually quite easy for our algorithm to violate all of (1)-(3), given the right assignment of coins to people and people to rooms.

This suggests that we should have formulated our criteria in expectational or probabilistic terms:

(1*) Expectational Calibration Within Groups: For each possible risk score, the expected percentage of individuals assigned that risk score who are actually positive is the same for each relevant group and equal to that risk score.
(2*) Expectational Equal False Positive Rates: The expected percentage of actually negative individuals who are falsely predicted to be positive is the same for each relevant group.
(3*) Expectational Equal False Negative Rates: The expected percentage of actually positive individuals who are falsely predicted to be negative is the same for each relevant group.

(There’s a tricky question about how to understand the probability function relative to which these expectations are determined. I’ll think of it as an evidential probability function which represents what a rational person who knew about the workings of the algorithm would expect.)

We can investigate whether our perfectly, 100% fair algorithm can violate any of these starred criteria by considering a case where coin biases match relative frequencies (i.e. exactly 75% of the coins labelled 0.75 land heads, and so on). If our algorithm violates one of the unstarred criteria in this case, then it also violates the starred version of that criterion.

It turns out that our perfectly, 100% fair algorithm must satisfy (1*), but it can violate (2*) and (3*), given the right assignment of coins to people and people to rooms. Moreover, it can violate them simultaneously. And surprisingly, it can violate them simultaneously even when base rates are equal across the two rooms.

The following case illustrates this: Room A contains 12 people with coins labelled ‘0.75’ and 8 people with coins labelled ‘0.125.’ The former are all assigned risk score 0.75 and predicted to be heads people (positive), and nine of them are in fact heads people. The latter are all assigned risk score 0.125 and predicted to be tails people (negative), and seven of them are in fact tails people. Room B contains 10 people with coins labelled ‘0.6’ and 10 people with coins labelled ‘0.4.’ The former are all assigned risk score 0.6 and predicted to be heads people, and six of them are in fact heads people. The latter are all assigned risk score 0.4 and predicted to be tails people, and six of them are in fact tails people.

Note that base rates are equal across the two rooms: exactly ten out of the twenty people in each room are heads people.

While our algorithm in this case satisfies (1) Calibration Within Groups, and hence also (1*), it violates (2) and (3), and hence also (2*) and (3*). For Room A, the False Positive Rate is 3/10, while for Room B it is 4/10. And for Room A, the False Negative Rate is 1/10, while the False Negative Rate is 4/10. This fair algorithm also violates a host of other statistical criteria of fairness that have been suggested in the literature – see the Appendix for details.

This means that it is possible for a perfectly, 100% fair algorithm to violate (2*) and (3*) when given the right population as input. This suffices to show that neither is a necessary condition on fairness. It also suffices to show that none of the other criteria considered in the Appendix are necessary for fairness, either. Only (1*) Expectational Calibration Within Groups is left standing as a plausible necessary condition on fairness.

300 words left

Upshots
I think (1*) Expectational Calibration Within Groups is plausibly necessary for fairness. I also think fairness might require that the ‘inner workings’ of the algorithm be a certain way, for instance that the algorithm be blinded to group membership and that it use the same threshold in going from a risk score to a binary prediction. There may be other necessary conditions as well.

But it is misguided to focus on any of the other statistical criteria of fairness considered here or in the Appendix. Those criteria are tempting due to the relative ease of checking whether they are satisfied. But we have seen that an algorithm’s violating those criteria doesn’t mean that the algorithm is in any way unfair.

Now, even a perfectly fair predictive algorithm might have troubling results when used to make decisions in a certain way. It might have a disparate impact on the relevant groups. But the right way to respond to this disparate impact will often be not to modify the predictive algorithm, but rather to modify the way decisions are made on the basis of its predictions, or to intervene in society in other ways, for instance through reparations, changes in the tax code, and so on.

Of course, some of these responses might be politically infeasible. This is especially true for some of the policies that might be most effective in redressing racial and other injustices. Reparations would be a case in point. It is difficult to imagine reparations becoming policy, despite Ta Nehisi Coates’ influential recent defense. If we can’t deal with racial (or other) injustices in these other ways, perhaps the best response is to chip away at injustice by modifying what was already a fair predictive algorithm. It’s not the ideal solution, but it might be second best. Still, it is important to be clear that an algorithm’s violating Equal False Positive/Negative Rates (or any of the other statistical criteria considered in the Appendix) neither entails nor constitutes the algorithm’s unfairness.

What next?
For an accessible explanation of the impossibility theorems, see this piece on Phenomenal World by Cosmo Grant.
For the back-and-forth about COMPAS, see the initial ProPublica report, Northpointe's response, and Propublica's counter-response.
For a more general discussion of the issues surrounding algorithmic fairness, see this article.

Appendix
In this Appendix, I’ll briefly mention some of the other main statistical criteria of fairness that have been considered in the literature:

(4) Balance for the Positive Class: The average risk score assigned to those individuals who are actually positive is the same for each relevant group.
(5) Balance for the Negative Class: The average risk score assigned to those individuals who are actually negative is the same for each relevant group.
(6) Equal Positive Predictive Value: The percentage of individuals predicted to be positive who are actually positive is the same for each relevant group.
(7) Equal Negative Predictive Value: The percentage of individuals predicted to be negative who are actually negative is the same for each relevant group.
(8) Equal Ratios of False Positive Rate to False Negative Rate: The ratio of the false positive rate to the false negative rate is the same for each relevant group.
(9) Equal Overall Error Rates: The number of false positives and false negatives, divided by the number of individuals, is the same for each relevant group.

The first two – (4) and (5) – can be seen as generalizations of (2) and (3) to the case of continuous risk scores. Indeed, Pleiss et al. refer to the measures involved in (4) and (5) as the ‘generalized false negative rate’ and the ‘generalized false positive rate,’ respectively. And, along with (1), it is these two criteria, rather than (2) and (3), that are the target of the aforementioned impossibility theorem from Kleinberg et al.

The next two – (6) and (7) – can be seen as generalizations of (1) to the case of binary predictions. This is how Chouldechova conceives of them. Just as (1) is motivated by the thought that a given risk score should ‘mean’ the same thing for each group, so (6) and (7) can be motivated by the thought that a given binary prediction should ‘mean’ the same thing for each group. Chouldechova’s impossibility result targets (6) rather than (1), showing that (2), (3), and (6) are not jointly satisfiable unless either base rates are equal or prediction is perfect.

The final two – (8) and (9) – are also intuitive. For (8), it would seem that violating this criterion would mean that the relative importance of avoiding false positives versus avoiding false negatives was evaluated differently for the different groups. Finally, (9) Equal Overall Error Rates embodies the natural thought that fairness requires that the algorithm be equally accurate overall for each of the different groups.

We can now easily see that our perfectly fair predictive algorithm violates all these criteria as well (and hence also their expectational or probabilistic analogues), given the same assignment of coins to people and people to rooms as above:

The fact that the numbers in the Room A column and the Room B column differ for each row means that our fair predictive algorithm violated all of (4)-(9), in addition to (2) and (3). This suffices to show that none of (4)-(9), nor their expectational/probabilistic analogues, is a necessary condition on fairness. Among all these statistical criteria, only (1*) Expectational Calibration Within Groups is left standing as a plausible necessary condition on fairness.

8 Comments

Milo Phillips-Brown link

8/23/2020 12:50:14 pm

Hi! Glad to see more philosophers engaging with this question/literature. Two quick thoughts:

1) The point about different sorts of policy interventions is really important to make! There's a big literature on what this might look like in the context of algorithmic fairness in general and criminal justice in particular -- see e.g. https://arxiv.org/abs/1712.08238.

2) Might be good to note that the (very influential and very good) paper by Barocas and Selbst you note for "what's next" isn't about *formal definitions of fairness*, but rather about different sorts of things that can be called algorithmic bias.

Brian Hedden

8/24/2020 05:45:04 am

Thanks Milo! I like your recommendation of the Barabas et al. And quite right about the broader scope of Barocas and Selbst. It's a fantastic article for getting a big-picture lay of the land.

Tom Stafford link

9/9/2020 01:48:35 am

Hello

I wonder if I could push a bit on the fairness of your toy example. This doesn't seem to me to be obviously the "bedrock" which you claim, or at least I would like to hear some discussion of what makes something fair. For example, doesn't it matter what the costs and benefits are of being incorrectly assigned? If being assigned heads is greatly rewarded then we might have an intuition that more injustice is done to a false negative than a false positive (we might not, but it could be discussed).

Secondly, I do like the toy example, and think it helps clarify other things. The setup made me think that sometimes an algorithm makes manifest an unfairness that is latent in the environment (like the distribution of probabilities in the two rooms). Perhaps some complaints about unfair algorithms stem from an intuition that latent unfairnesses are somehow endorsed or confirmed when incorporated into an algorithm. And perhaps there is an another intuition that algorithms should *correct for*, rather than merely pass on without augmenting or correcting, latent environmental biases.

9/9/2020 06:17:16 pm

Hi Tom,

Great comments. Let me start with the first. I do think that our intuitions about the fairness of my algorithm might be less clear if it is used to make decisions rather than just predictions. I think it's important to sharply distinguish between predictions and decisions (between the epistemology and the decision theory), and between predictive bias/unfairness and bias/unfairness in decisions. As I formulated it, my algorithm just a predictive algorithm, divorced from any decisions about rewards and punishments. Now, you might think that makes things trivial, for there trivially cannot be any unfairness if we're just making predictions and not any further decisions. But I'm not sure about that. It seems plausible that an algorithm for just making predictions could be unfair or biased, and that's also supported by some recent literature on epistemic injustice and moral encroachment.

But suppose that the algorithm were used to make decisions as well, with (let's say) false positives being more costly to the individual than false negatives, and that both are more costly than true positives or true negatives. I would then agree that the algorithm's use is worse overall for Room B people than for Room A people. But this doesn't mean that it's unfair to any individuals *in virtue of their room membership*. This is because the individuals were randomly assigned to one room or another, and because the algorithm is blind to room membership and to any proxies thereof.

One might say that while the algorithm (again, now used to make decisions and not just predictions) isn't unfair to any individuals in virtue of their room membership, it's unfair to the group consisting of Room B people as a whole. That is, it yields unfairness to groups, not unfairness to individuals. This might be, but group-level unfairness seems like a tricky issue. The algorithm might yield overall worse outcomes for Room B people than for Room A people, but I'd resist the claim that yielding worse outcomes for one group than another means that it's unfair to the one group. But I agree, at least, that we're no longer in the realm of 'bedrock' principles!

For your second comment, I wholeheartedly agree. I think that the toy example helps clarify the way in which an algorithm can make manifest an underlying unfairness without itself being unfair. (I don't think that there's underlying unfairness in my example, but there would be in more real-world cases.)

As for the intuition that algorithms should correct for, rather than merely pass on, latent environmental biases, I think that depends on what options we're considering. Toward the end of my post, I suggested that it might be better not to make the predictive algorithm itself corredct for these latent biases, but rather to intervene in other ways - either by directing correcting for the latent environmental biases (think of changes in the tax code, or reparations, or what have you) or by changing how decisions are make on the basis of predictions (for instance by using different risk thresholds for different groups in deciding how to act - but not in deciding what to predict). But I agree that these other interventions won't always be possible. It might be that, say, income tax decisions are made by one party while the design of a predictive algorithm is made by another party. In that case, the algorithm designers, knowing that the income tax or other optimal interventions - those which could correct directly for the underlying unfairness/bias - won't be made, in which case the algorithm designers might have good reason to try to get their algorithm to do much of the work or correcting underlying unfairness/bias by itself. I don't think that this would be ideal, optimal overall policy. But it might be the best that we can achieve, holding fixed various political constraints.

Thanks again for the comment!

9/10/2020 01:27:39 am

Thanks Brian
I hadn't appreciated the distinction you draw between a prediction algorithm, as in your example, and a decision making algorithm. I my defence, I think it will be a common intuition to use predictions as decisions (i.e. for algorithm consumers to fall for the naturalistic fallacy), but just because that mistake may happen doesn't mean we should make it too, of course.

brian hedden

9/10/2020 05:33:02 am

Hi Tom,

I agree with that. My sense of the literature is that in discussing, say, false positive rate equality, some people really are talking about predictions, while others are talking about naturally corresponding decisions. In the ProPublica article on COMPAS, the notion of a false positive is understood as just a matter of prediction. But other authors talk of a decision to deny someone a loan, say, as a prediction that they would default.

So one could defend statistical criteria relating certain decisions to actual outcomes. For instance, one could defend modified version of equal false positive rates that says that the percentage of people who received the favorable action (granted parole, given a loan, etc) who actually displayed the disfavoured property (recidivated, defaulted, etc) should be equal across groups.

My example doesn't tell against these decision-theoretic statistical criteria. But here's one reason I'm inclined to be skeptical of them: They ignore the effects of the algorithm's use on those members of relevant groups whose behaviour is never predicted by the algorithm. For instance, it ignores the algorithm's effects on members of different racial groups who never have contact with the criminal justice system and therefore never receive a prediction from the algorithm as to whether they'll recidivate, or who never apply for a loan and therefore never receive a prediction as to whether they'd default.

I'm inclined to think that whether the use of some algorithm to make decisions is fair to groups or not depends not only on how it treats those who come into contact with the algorithm (so to speak), but also how its use affects members of the relevant groups more broadly.

That's my inclination, but there's certainly more to be said on the matter.

Rafal Urbaniak link

12/13/2020 06:29:20 am

Hi Brian,

Great post, thanks!

Maybe I'm making a silly mistake, and even if not, this doesn't undermine your main point, but I can't follow your reasoning about FPR and FNR in room A.

In A, 12 people are labeled .75, all are predicted positive, and 9 of them are in fact positive, so 3 of them aren't, and so it seems FPR is 3/12=1/4. But you say "For Room A, the False Positive Rate is 3/10."

Similarly, in A, 8 people are predicted negative, and 7 of them are negative, so the FNR seems to be 1/8, and yet you say "For Room A, the False Negative Rate is 1/10".

The FPR and FNR in room B seem to be correct, and so even if I'm right, the rates differ between the rooms.

Also, I think a mention of room B is missing from this sentence:

"And for Room A, the False Negative Rate is 1/10, while the False Negative Rate is 4/10."

A different question: just because calibration is satisfied in the example, it doesn't follow that it's a necessary or sufficient condition for fairness; how would you defend it? (and as what sort of condition)?

12/13/2020 07:15:24 pm

Hi Rafal,

Thanks for the comments. You're right about the missing mention of Room B in that sentence. It should be that the Room A FNR is 1/10, while the Room B FNR is 4/10.

For your first question, I think you're actually thinking of what is often called the 'Positive Predictive Value' rather than the False Positive Rate. The terminology is not always terribly perspicuous but let me try to lay down some of the key terms and definitions:

False Positive Rate is the percentage of actually negative people who are falsely predicted to be positive.

False Negative Rate is the percentage of actually positive people who are falsely predicted to be negative.

Positive Predictive Value is the percentage of people predicted to be positive who are actually positive.

Negative Predictive Value is the percentage of people predicted to be negative who are actually negative.

So when you crunched the numbers and got 1/4, that's actually 1 minus the positive predictive value (the percentage of positive predictions which were incorrect), and when you got 1/8, that's actually 1 minus the negative predictive value.

I'll email you the paper I have on this, where I give the definitions of all the main relevant statistical properties considered in the literature.

As for your question about how to justify Calibration, that's a hard question. All my case shows is that it COULD be necessary for fairness, since it's not violated in that case. I also do not think it's sufficient for fairness.

And actually, I think it might be worth breaking Calibration Within Groups into two sub-conditions. The first says that, for each risk score, the (expected) percentage of people assigned that risk score who are actually positive should be the same for each relevant group. The second says that, moreover, that (expected) percentage should be equal to that risk score. I think that the second is more properly regarded as a criterion of accuracy rather than fairness. But the first is plausibly necessary for fairness. It's hard to see how it could be violated without treating individuals differently depending on which group they came from.

Having said that, there may also be a tension between Calibration and another (non-statistical) criterion which requires that the algorithm be blind to group membership. In some cases, given the limited information available to us, it may be possible to satisfy Calibration only if the algorithm explicitly takes into account group membership when making predictions. For instance, suppose that the coins aren't labeled with their bias, but all we know is that Room A people all have coins with bias 1/4 while Room B people all have coins with bias 3/4. Then, the only way to satisfy Calibration would be to predict based on room membership, assigned any Room A person a risk score of 1/4 and any Room B person a risk score of 3/4. If that's right, and Calibration and the blindedness conditions aren't actually jointly satisfiable, we face a choice about which one should go. I'm inclined to keep Calibration and reject the condition saying that the algorithm must be blind to group membership. But I can see a case for going the other way.

Thanks!

Stranger Apologies

How (Not) to Test for Algorithmic Bias (Guest Post)

Leave a Reply.

Kevin Dorst

Archives

Categories