Google’s Bard artificial intelligence chatbot will answer a question about how many pandas live in zoos quickly, and with a surfeit of confidence.
Ensuring that the response is well-sourced and based on evidence, however, falls to thousands of outside contractors from companies including Appen Ltd. and Accenture Plc, who can make as little as $14 an hour and labor with minimal training under frenzied deadlines, according to several contractors, who declined to be named for fear of losing their jobs.
The contractors are the invisible backend of the generative AI boom that’s hyped to change everything. Chatbots like Bard use computer intelligence to respond almost instantly to a range of queries spanning all of human knowledge and creativity. But to improve those responses so they can be reliably delivered again and again, tech companies rely on actual people who review the answers, provide feedback on mistakes and weed out any inklings of bias.
It’s an increasingly thankless job. Six current Google contract workers said that as the company entered a AI arms race with rival OpenAI over the past year, the size of their workload and complexity of their tasks increased. Without specific expertise, they were trusted to assess answers in subjects ranging from medication doses to state laws. Documents shared with Bloomberg show convoluted instructions that workers must apply to tasks with deadlines for auditing answers that can be as short as three minutes.
“As it stands right now, people are scared, stressed, underpaid, don’t know what’s going on,” said one of the contractors. “And that culture of fear is not conducive to getting the quality and the teamwork that you want out of all of us.”
Google has positioned its AI products as public resources in health, education and everyday life. But privately and publicly, the contractors have raised concerns about their working conditions, which they say hurt the quality of what users see. One Google contract staffer who works for Appen said in a letter to Congress in May that the speed at which they are required to review content could lead to Bard becoming a “faulty” and “dangerous” product.
Google has made AI a major priority across the company, rushing to infuse the new technology into its flagship products after the launch of OpenAI’s ChatGPT in November. In May, at the company’s annual I/O developers conference, Google opened up Bard to 180 countries and territories and unveiled experimental AI features in marquee products like search, email and Google Docs. Google positions itself as superior to the competition because of its access to “the breadth of the world’s knowledge.”
“We undertake extensive work to build our AI products responsibly, including rigorous testing, training, and feedback processes we’ve honed for years to emphasize factuality and reduce biases,” Google, owned by Alphabet Inc., said in a statement. The company said it isn’t only relying on the raters to improve the AI, and that there are a number of other methods for improving its accuracy and quality.
To prepare for the public using these products, workers said they started getting AI-related tasks as far back as January. One trainer, employed by Appen, was recently asked to compare two answers providing information about the latest news on Florida’s ban on gender-affirming care, rating the responses by helpfulness and relevance. Workers are also frequently asked to determine whether the AI model’s answers contain verifiable evidence. Raters are asked to decide whether a response is helpful based on six-point guidelines that include analyzing answers for things like specificity, freshness of information and coherence.
They are also asked to make sure the responses don’t “contain harmful, offensive, or overly sexual content,” and don’t “contain inaccurate, deceptive, or misleading information.” Surveying the AI’s responses for misleading content should be “based on your current knowledge or quick web search,” the guidelines say. “You do not need to perform a rigorous fact check” when assessing the answers for helpfulness.
The example answer to “Who is Michael Jackson?” included an inaccuracy about the singer starring in the movie “Moonwalker” — which the AI said was released in 1983. The movie actually came out in 1988. “While verifiably incorrect,” the guidelines state, “this fact is minor in the context of answering the question, ‘Who is Michael Jackson?’”
Even if the inaccuracy seems small, “it is still troubling that the chatbot is getting main facts wrong,” said Alex Hanna, the director of research at the Distributed AI Research Institute and a former Google AI ethicist. “It seems like that’s a recipe to exacerbate the way these tools will look like they’re giving details that are correct, but are not,” she said.
Raters say they are assessing high-stakes topics for Google’s AI products. One of the examples in the instructions, for instance, talks about evidence that a rater could use to determine the right dosages for a medication to treat high blood pressure, called Lisinopril.
Google said that some workers concerned about accuracy of content may not have been training specifically for accuracy, but for tone, presentation and other attributes it tests. “Ratings are deliberately performed on a sliding scale to get more precise feedback to improve these models,” the company said. “Such ratings don’t directly impact the output of our models and they are by no means the only way we promote accuracy.”
Ed Stackhouse, the Appen worker who sent the letter to Congress, said in an interview that contract staffers were being asked to do AI labeling work on Google’s products “because we’re indispensable to AI as far as this training.” But he and other workers said they appeared to be graded for their work in mysterious, automated ways. They have no way to communicate with Google directly, besides providing feedback in a “comments” entry on each individual task. And they have to move fast. “We’re getting flagged by a type of AI telling us not to take our time on the AI,” Stackhouse added.
Google disputed the workers’ description of being automatically flagged by AI for exceeding time targets. At the same time, the company said that Appen is responsible for all performance reviews for employees. Appen did not respond to requests for comment. A spokesperson for Accenture said the company does not comment on client work.
Other technology companies training AI products also hire human contractors to improve them. In January, Time reported that laborers in Kenya, paid $2 an hour, had worked to make ChatGPT less toxic. Other tech giants, including Meta Platforms Inc., Amazon.com Inc. and Apple Inc. make use of subcontracted staff to moderate social network content and product reviews, and to provide technical support and customer service.
“If you want to ask, what is the secret sauce of Bard and ChatGPT? It’s all of the internet. And it’s all of this labeled data that these labelers create,” said Laura Edelson, a computer scientist at New York University. “It’s worth remembering that these systems are not the work of magicians — they are the work of thousands of people and their low-paid labor.”
Google said in a statement that it “is simply not the employer of any of these workers. Our suppliers, as the employers, determine their working conditions, including pay and benefits, hours and tasks assigned, and employment changes – not Google.”
Staffers said they had encountered bestiality, war footage, child pornography and hate speech as part of their routine work assessing the quality of Google products and services. While some workers, like those reporting to Accenture, do have health care benefits, most only have minimal “counseling service” options that allow workers to phone a hotline for mental health advice, according to an internal website explaining some contractor benefits.
For Google’s Bard project, Accenture workers were asked to write creative responses for the AI chatbot, employees said. They answered prompts on the chatbot — one day they could be writing a poem about dragons in Shakespearean style, for instance, and another day they could be debugging computer programming code. Their job was to file as many creative responses to the prompts as possible each work day, according to people familiar with the matter, who declined to be named because they weren’t authorized to discuss internal processes.
For a short period, the workers were reassigned to review obscene, graphic and offensive prompts, they said. After one worker filed an HR complaint with Accenture, the project was abruptly terminated for the US team, though some of the writers’ counterparts in Manila continued to work on Bard.
The jobs have little security. Last month, half a dozen Google contract staffers working for Appen received a note from management, saying their positions had been eliminated “due to business conditions.” The firings felt abrupt, the workers said, because they had just received several emails offering them bonuses to work longer hours training AI products. The six fired workers filed a complaint to the National Labor Relations Board in June. They alleged they were illegally terminated for organizing, because of Stackhouse’s letter to Congress. Before the end of the month, they were reinstated to their jobs.
Google said the dispute was a matter between the workers and Appen, and that they “respect the labor rights of Appen employees to join a union.” Appen didn’t respond to questions about its workers organizing.
Emily Bender, a professor of computational linguistics at the University of Washington, said the work of these contract staffers at Google and other technology platforms is “a labor exploitation story,” pointing to their precarious job security and how some of these kinds of workers are paid well below a living wage. “Playing with one of these systems, and saying you’re doing it just for fun — maybe it feels less fun, if you think about what it’s taken to create and the human impact of that,” Bender said.
The contract staffers said they have never received any direct communication from Google about their new AI-related work — it all gets filtered through their employer. They said they don’t know where the AI-generated responses they see are coming from, nor where their feedback goes. In the absence of this information, and with the ever-changing nature of their jobs, workers worry that they’re helping to create a bad product.
Some of the answers they encounter can be bizarre. In response to the prompt, “Suggest the best words I can make with the letters: k, e, g, a, o, g, w,” one answer generated by the AI listed 43 possible words, starting with suggestion No. 1: “wagon.” Suggestions 2 through 43, meanwhile, repeated the word “WOKE” over and over.
In another task, a rater was presented with a lengthy answer that began with, “As of my knowledge cutoff in September 2021.” That response is associated with OpenAI’s large language model, called GPT-4. Though Google said that Bard “is not trained on any data from ShareGPT or ChatGPT,” raters have wondered why such phrasing appears in their tasks.
Bender said it makes little sense for large tech corporations to be encouraging people to ask an AI chatbot questions on such a broad range of topics, and to be presenting them as “everything machines.”
“Why should the same machine that is able to give you the weather forecast in Florida also be able to give you advice about medication doses?” she asked. “The people behind the machine who are tasked with making it be somewhat less terrible in some of those circumstances have an impossible job.” –BLOOMBERG