People must give more thought to the objectives of hiring algorithms, experts say, or else human biases may be perpetuated in software.

People must give more thought to the objectives of hiring algorithms, experts say, or else human biases may be perpetuated in software.

Fifty-five years ago, U.S. president Lyndon Johnson signed into law the Civil Rights Act, which he said was a continuation of “the unending search for justice within our own borders.” Today, as the promise of the Great Society continues to elude us, can software algorithms, which are taking on ever more roles in our world, be trusted to deliver it, at last?

Not without the help of humans, who must give a great deal more thought to what objectives these algorithms are supposed to solve, experts say. According to a recent study by researchers at Cornell University and Microsoft, software programs in the workplace, where diversity is a stated goal, may already be pursuing goals that are vastly over-simplified, thereby perpetuating bias.

The researchers reached this conclusion by examining publicly available material from 19 venture-backed vendors of software used to automate hiring, companies such as HireVue of South Jordan, Utah, and Pymetrics of New York, whose products can automate steps such as subjecting job applicants to an online questionnaires, or analyzing online video interviews.

“Vendors will say empirically their customers’ diversity numbers are going up,” says lead investigator Manish Rhagavan, a computer scientist with Cornell University. “In practice it might be true, but simply recommending more people is not a guarantee you are recommending the right people.”

What’s the real goal?

It should be easier to tweak an algorithm to remove bias, as University of Chicago computer scientist Sendhil Mullainathan has written. But in practice, it’s not that simple. When assessing other humans for jobs, people need to think about a range of questions regarding what they expect. In hiring from different groups, “you have to ask what are the sources of disparity,” he says. “It’s an investigation.”

In contrast, goals encoded in the software, for one thing, are simplistic—maybe too simplistic.

Take, for instance, the “four-fifths rule.” Part of the Equal Employment Opportunity Commission’s 1978 Uniform Guidelines for hiring, the four-fifths rule says the rate at which people of any group are selected by an employer should not be less than four-fifths, or 80%, of the rate at which people are selected from the most commonly selected group. The groups in this case are so-called protected classes, defined by race, sex, age, religion, disability status, and veteran status.

So if an employer hires 10% of the male candidates it interviews, for example, and hires fewer than 8% of the female candidates, the selection rate of women falls below 80% and there may be a case for bias.

Software algorithms looking to meet the goal of simply having more job candidates from a certain group can produce a sufficiently diverse list of candidates. But how well are they assessing each group? A program can be better at selecting candidates from one group than another, a phenomenon known as “differential validity.”

“You might have the best men and a merely random sample of the women,” says Rhagavan. Even with enough female candidates to be compliant with the four-fifths rule, “that could confirm some of the negative stereotypes,” he adds.

Baseline biases

Humans need to think about the validity of performance reviews, too. “Performance reviews can have a lot of bias embedded in them,” says Rhagavan. “The people whose evaluations you are using for ground truth [to train an algorithm] may not have been very good at doing evaluations in the first place,” he points out.

Software functions that measure people can be misleading. For example, video interviews can measure how quickly a job candidate speaks. If the algorithm was not exposed to many non-native English speakers when it was first developed, it could be biased to approving native speakers. But, “we don’t know that the cadence of your speech is predictive” of actual job performance, says Rhagavan.

The Cornell-Microsoft findings confirm some observers’ concern that not enough is done to examine human biases to begin with.

“If humans are biased, I don’t quite understand how we can expect the technology to have a much better outcome than what it currently is exposed to,” says Marlon Twyman II, an assistant professor in the communications school at USC Annenberg who was not involved in the research.

That means much more transparency is needed about the algorithms and the data. One question for vendors is “What are their plans are for a thorough, rigorous, transparent validation” of their algorithms, says Rhagavan.

Pymetrics is “in final discussion stages to secure an expert third party to conduct an external audit of our system, and plan to have results publicly available by summer 2020,” says Frida Polli, Pymetrics’ co-founder and CEO tells Fortune. The company is also in the process of a peer-reviewed research paper on its methods.

The 4/5ths rule should be considered a “minimum standard for avoiding bias,” says Polli. Although it is not the only measure the company’s software uses, even just meeting 4/5ths is an improvement upon human activity, she says. “Most common hiring practices fail the 4/5ths rule,” she added.

HireVue echoed Pymetric’s remarks that meeting the 4/5ths rule can be an improvement over biases in the human hiring process. At the same time, the company tells Fortune it is “actively engaging on this topic, including with our advisory board, as we recognize the 4/5ths rule is the legal standard but there are broader questions about fairness.”

HireVue added it is currently in the process of selecting an independent third-party auditor to review their technology.

“For several years, our scientists have presented on our development, validation, and bias-mitigation processes and results at professional conferences,” the company says. “HireVue continues to be in close discussions with the scientific community regarding its technology.”

But disclosure is just the beginning of an investigation, Rhagavan says. A human discussion needs to happen about what goes into measuring people, and it needs to happen between humans inside the companies but also the broader society, he adds.

“That’s the biggest thing the public needs to know,” he says. ”How they can have confidence that these methods actually work?”

[“source=fortune”]