Expertise as Proxy: Stabilizing Uncertainty in Reinforcement Learning with Human Feedback (RLHF)

In their earliest iterations, Large Language Models (LLMs) didn’t have the exaggerated mannerisms of a helpful assistant that have since become their hallmark. They were limited to finishing sentences or curtly answering questions, greatly lowering their efficacy. This turned around with the development of “InstructGPT” in January 2022, the first model to use Reinforcement Learning from Human Feedback, the final step in post-training. This process was later cited as a key step in pushing LLMs towards this question-answer paradigm that is now common. Since then, RLHF has remained invaluable in making models more accurate (especially for scientific and coding tasks), reducing hallucinations, aligning the model with human values and giving the model a “personality” (Ouyang, Long, et al. 2) (Bai, Yuntao, et al. 15). As the name suggests, this process is centered around collecting human preferences to train the model. 

This paper will focus on the evolving role of human labor in this process, with an emphasis on the emerging role of “experts” in this pipeline (Lu). It will show how “expert” training of models during RLHF is a murky mixture of legitimate model improvement, stretched epistemic claims and hidden human labor. It investigates how, while high-quality, well labeled data by domain experts can undoubtedly better model outcomes, “expertise” is often mobilized to establish epistemic authority across vastly different domains, paper over fundamental model weaknesses in certain fields and act as a proxy for general human experience. It also examines how these experts are mobilized narratively by frontier labs in their pursuit of AGI and “PhD level AI”, and data labeling companies attempting to sell their services to these labs. Lastly, it will discuss how the category of “expert” stabilizes the often disorganized and incoherent labor experience of the contractors working for data labeling companies, comparing their working conditions to more traditionally understood “unskilled data work” 

System Introduction

Reinforcement Learning with Human Feedback was developed atop traditional Reinforcement Learning, a method popular in Machine Learning circles since the 1990s. This method works when an “agent” takes an action upon the target environment, this action is evaluated according to a reward function, and depending on the output of the reward function, the model makes a change according to a “policy”. To use RLHF, the InstructGPT team created a dataset of human preferences, annotating several model responses to a certain prompt ranking them in order of desirability according to certain defined criteria (Ouyang, Long, et al. 3). This dataset was used to train a reward model that was then ostensibly able to predict what kind of response a human would prefer. Technically, this looks like a model that can take in an input-output pair and assign it a scalar value (Ouyang, Long, et al. 8). This reward model was then used as part of the loop described above, with the underlying model being updated depending on the output of the reward model. A key point to note here is that using RLHF, there is a need to make sure that models don’t “reward hack”, that is, produce gibberish to exploit statistical openings in the reward model. Since the launch of InstructGPT, RLHF has become a staple method for LLM training, including in latest frontier models like GPT-5. As it stands RLHF provides a few main benefits; it “aligns” the model with human preferences, adjusting things like affect and tone. It makes it better at following instructions and most importantly, it reduces hallucinations and increases factuality and accuracy (Ouyang, Long, et al. 11-14). Notably, it can be combined with other techniques to surface pre-trained knowledge in an organized and structured manner, especially for the purposes of coding and math related problems.

Human labor is clearly an integral part of this process. A common refrain across interviews I conducted and Reddit threads in these communities is that circa 2023, most RLHF tasks involved correcting rudimentary errors around sentence construction, grammar and form. These tasks typically did not require credentials beyond high school diplomas and a basic aptitude test. It must be noted that RLHF had great early success refining the outputs of these LLMs. As their capabilities grew, these basic errors became less frequent, and the role of RLHF data labellers became to check model responses for more subtle qualities like nuance, context, tone and logical consistency. The role of domain experts in areas like Math, Computer Science and Linguistics also became pronounced as LLMs became more capable in those fields and required people with expertise to verify the outputs (Lu). A Computer Science Masters student I spoke with pegged the coding ability of LLMs in 2024 as “somewhere between an undergrad and masters student.” In the past couple of years, a series of startups like Outlier, SurgeAI, Tolooka and others have emerged advertising themselves as platforms on which machine learning practitioners can find “experts” to label their data, with RLHF data being the most prominent application. These websites promise expertise in domains across the physical sciences, social sciences, humanities and art, with the Surge AI website declaring its goal to be a model that emulates figures like Hemingway and Von Neumann (Surge AI Website). Conducting interviews with multiple “expert” data labelers and analyzing various threads in company-specific subreddits, one can see how the category of “expert” is applied to confer legitimacy across several domains. In some cases, this expertise can contribute meaningfully to model performance, in other cases the expertise is not the kind that can be conveniently transferred, or the concept of an “expert” itself is out of place. This analysis can then be contrasted with claims made by frontier labs and data labeling companies.

Methodology

This paper applies Suchman’s call to focus on the “locations, politics, material-semiotic specificity and effects, including consequences of the ongoing enactment of AI as a singular and controversial object” (Suchman 4) by narrowing in on a specific stage of the LLM training pipeline, rejecting the idea of an LLM as a large black box. By analyzing the discursive role played by the category of “expert”, it is also building on Jaton and Sormani’s call to focus on the “the performances that make ‘AI’ appear or disappear.” In order to build this specific and situated understanding of RLHF, this paper draws on several sources. The primary source is a series of 5 semi-structured interviews with expert data labellers on the Outlier and SurgeAI platforms. The interviewees are a mix of graduates from bachelors, masters and PhD programs, spanning the fields of linguistics, politics, philosophy and computer science. The interviews cover their experiences as data laborers, focusing on the details of the tasks they were asked to complete. I inquired about moments where they had to apply their subjective judgements, as well as moments where they were asked to complete tasks removed from their area of expertise. The interviewees described tasks ranging from correcting code and mathematical logic, analyzing model writing for readability and logical flow, and evaluating the emotional register of the model’s response. These interviews are supplemented with discourse analysis of threads from two subreddits - r/outlier_ai and r/dataannotationtech. I chose these two as these correspond to the two expert data labeling companies I plan to focus on - Outlier AI and Surge AI. These Reddit threads gave me access to the experiences of a wider community of data labellers, who confirmed many of the observations made by the interviews as well as opened up new grounds for exploration. I further analyzed blog posts, press releases and promotional material, both client facing and contractor facing, put out by the data labeling platforms Outlier AI and Surge AI. These materials gave me an idea of the kinds of claims these platforms were making about the ability of their “expert” data to boost model performance, as well as the promises they were making to their large contract workforce. Job postings hiring for experts in different fields also gave me an idea of what kind of skills these companies are looking for. Lastly, I focused on frontier labs, mainly OpenAI and Anthropic, analyzing their promotional material around RLHF to see the degree to which they linked model performance to RLHF. I also analyzed technical papers coming out of frontier labs to get a better understanding of the role of RLHF in modern LLM training. This is merely an early investigation into this phenomenon; these sources are not expected to provide a comprehensive overview of expert involvement in RLHF, rather they suggest patterns to be investigated further. 

The function of “expertise” in RLHF

In his book “Trust in Numbers”, Porter describes how “objectivity” mobilizes feelings of reassurance from the public. He points out that engineers, doctors and lawyers are considered to be “objective” while politicians and salesmen are not (Porter 3). This is worth keeping in mind as we discuss how these data labeling companies utilize the notion of “expertise” across different domains. Before discussing that, it is important to establish at the outset how expertise is defined by data labeling platforms. According to the interviewees, expertise on platforms like Outlier AI and Surge AI is defined almost entirely by credentials, with PhD students getting the most complicated tasks and highest pay, while undergraduates get access to much simpler tasks for pay closer to minimum wage ($15/hr). This distinction is confirmed by an analysis of job postings from these companies that only emphasize the credentials required and the field. So it’s common to see postings like “Advanced x expert sought for model training” (Outlier Job Board) with the requirements being an advanced degree and proficiency in English without any other details. Internally, per my interviews, experts are also divided into two categories; “specific” experts, with domain knowledge in a particular field, and “generalist” experts, that are usually assigned tasks analyzing model writing. The idea of the “general” expert is in some sense paradoxical, as “expertise” is colloquially associated with specificity as opposed to generality. Here, the term “expert” is used in line with Porter’s description to mobilize feelings of objectivity in tasks like creative writing or emotional modulation that might not fit neatly within the framework of “expertise”. Further, as Porter highlights, expert knowledge cannot be “reduced to a handful of rules”, there is an intuition that comes with experience that continues to be valuable (Porter 7). Keeping this idea of expertise in mind, one can analyze some of the tasks performed by “expert” data labellers. 

The domains that appeared most frequently during my search for interviewees as well as my analysis of reddit threads and labeling job postings were those centered around STEM and the natural sciences. This is not entirely surprising as one of the often touted benefits of RLHF is the increased accuracy rate and reduction of hallucinations. There is typically wide agreement on the “correct” answer to problems well into the graduate level in STEM fields, making this an ideal ground for RLHF. If enough data annotators converge on the correct solution to a set of problems, that behavior can be strongly reinforced in the LLM via the reward model. As the models get more capable, domain experts are required to verify accuracy of the models outputs, something that might not be possible at first glance for a layperson. An interviewee doing expert labeling for Math and Computer Sciences described evaluating code produced by the model, correcting bugs, syntax and logical flow. For Math problems, they were asked to focus on logical flow and context retention, making sure the model maintained accuracy through the entire exchange. They also noted that the guidelines emphasized accuracy in the intermediate steps, not just the final solution of a problem. These are also the domains in which RLHF has had the most success. Models have become notably better at coding tasks, for example (Barr). However, at the PhD level, this unified consensus on scientific problems begins to crack a little. Across mathematics and the physical sciences, there are disagreements at the frontier. There are also fewer people that can review model outputs and give it direction. I reached out to several PhD level scientists who worked as data labellers to understand the application of RLHF to frontier scientific concepts but wasn’t able to schedule an interview in time. Therefore, this following section is more speculative than empirical. That being said, it is easy to see this mix of RLHF and limited “expertise” leading to phenomena like “vibe physics” where models confidently regurgitate vaguely scientific text that touches on correct terminology in a disordered and ultimately nonsensical way (Siegel). Of course RLHF is not the only tool frontier labs have to improve model performance in scientific fields but it does call into question the claims of expert data labeling platforms about the PhD potential of models reinforced by expert RLHF. A fitting end to this section would be to recall Narayanan and Kapoor’s warnings that we should treat any scientific outputs from Large Language Models with “extreme caution” (Narayanan and Kapoor 245), especially as we begin to use them to enable new discoveries, as opposed to solving established problems. 

Discussing expertise, Porter points out “where a consensus of experts is hard to reach, or where it does not satisfy outsiders, mechanical objectivity comes into its own. [...] It means following the rules. Rules are a check on subjectivity: they should make it impossible for personal biases or preferences to affect the outcome of an investigation” (Porter 4). This attitude is most clear in expert data labeling companies’ treatment of the humanities. Across their websites, these companies insist that RLHF expertise is not limited to STEM, invoking the involvement of humanities “experts” in these RLHF processes. I spoke to a Philosophy PhD student who held expert status on Outlier AI in the fields of Philosophy, Psychology and Education Theory. They described the process of imparting humanities expertise to these models as “reductive”, claiming that the labeling companies expected straightforward answers to complex philosophical questions. They admitted that completing tasks in the humanities involved an element of “playing the game” that didn’t feel intellectually honest. Reddit threads by humanities “experts” support this contention. One user says the labeling companies are only interested in checking whether models are able to recall a specific theorist or attribute a certain theory to the right thinker without getting into the specifics of what these theories entail and how they interact with one another (Reddit thread #2). It is clear that the RLHF model isn’t a natural fit to increase “accuracy” when what is “accurate” is subjectively defined but not entirely random, a fine line to walk. The typical model of reinforcement is insufficient here; if experts are allowed to give their feedback in an unconstrained manner, it will likely not be similar enough to reliably change model behavior. As Porter describes, when a consensus is hard to reach, instead of admitting the epistemic limits of this kind of approach, companies introduce mechanical objectivity, reducing complex topics to a list of rules. This phenomenon is also reminiscent of Kang’s assertion that the establishment of a ground truth always necessitates a “flattening” of a messy problem into something that can be reliably datafied or quantified (Kang 3). A clear solution here would be to admit to this flattened version of knowledge transfer and acknowledge the limits of an LLM fine tuned through RLHF in the realm of humanities. However, AI companies, both frontier labs as well as data labeling companies are prone to exaggerating model capabilities as part of a larger project of asserting the imminence of AGI. Therefore, it becomes harder to both acknowledge where certain techniques might be useful as well as point out less than ideal uses of these techniques. 

As described earlier, expert data labeling platforms have two separate internal designations for their employees. “Specific” experts who are verified as domain experts in a particular field, and “generalist” experts who are given a wide range of tasks that don’t necessarily relate to a specific area of expertise. These tasks can range from analyzing model writing for tone, affect, emotional register and logical flow, to fact checking complex claims. Mulvin describes proxies as “necessary forms of make-believe and surrogacy that enable the production of knowledge” (Mulvin 4).  He points out that proxies enable the “theatrical enactment of objectivity” (Mulvin 5), but a lot of work goes into stabilizing them. Once stabilized, these proxies are taken to accurately represent everything they’re supposed to be standing in for. Mulvin’s critique of the idea of “reasonable people” as proxies for a fictitious “objective” person used for the legal field (Mulvin 15) is also applicable here. With this definition in mind, we can examine some of the testimonies in my interviews with “general experts” about the tasks they were asked to perform, and understand how academic credentials can act as a proxy for a diverse array of human experiences. 

A striking example of this was revealed in my conversation with a Masters graduate in Linguistics. They described one of their tasks as having to regulate the emotional state with which the model responded to user prompts. They explained that this task involved them assuming various emotional states as the user. They would create prompts in happy, sad, angry or jealous registers, for example, and evaluate the degree to which the model acknowledged the emotional register of the prompt. They said a good response would identify the user’s emotional state and attempt to mimic it or assuage it in some way. When I pressed for more details, they cited confidentiality agreements, indicating that this work was of particularly sensitive nature. When I suggested that the kind of response someone would expect in a certain emotional state was not obvious, and varied based on a number of factors, they shrugged and said that they did their best to imagine what a valid response would be. Setting aside the wisdom of having a model that responds to often complicated and loaded emotions in this manner, even taken at face value this premise is epistemically unsound. This process is reminiscent of CEA described by Kang, where actors try and simulate certain emotions to construct a ground truth dataset. As Kang points out, this kind of acting is not generalizable beyond a particular group, in this case, the data labellers (Kang 7). This doesn’t ensure that the user's message gets an adequate emotional response. In this case, experts as designated by the data labeling firm act as “proxies” for the subjective human experience of interpreting and responding to emotions. Their “expertise” doesn’t account for the variety of social, cultural and interpersonal factors that go into this process, but the label of “expert” adds a veneer of legitimacy. This is particularly important to note as human-model alignment and added empathy are cited as core benefits of RLHF. Further, recent studies show a rise in the use of LLMs for companionship and therapy (Rousmaniere et al). By making “experts” proxies for subjective emotional interpretation, we risk legitimizing a fundamentally unsound process. We are also at risk of creating misaligned expectations among users as to what emotional responses should look like in the first place.

I came across another example of expertise acting as a proxy for human experience while browsing job postings on expert data labeling sites. I found an application for a “general” expert, and one of the advertised responsibilities was “writing stories based on certain prompts.” Upon further investigation I came across a Reddit post written by a Masters student in a Creative Writing program describing the kind of creative tasks they performed (Reddit thread #4) . The tasks included writing a series of short stories in response to prompts provided by the various AI labs. I found the idea of an “expert” endowing an LLM with creativity fascinating. Socially, it is widely acknowledged that creativity can come from anywhere, not necessarily from someone with a Masters degree. In fact, many innovative authors and poets often come from outside elite institutions and are lauded for their ability to provide a fresh perspective. However, if data labeling companies truly tried to gather creative input from a wider sample of the population, it would likely not contain statistical patterns significant enough for the model to really learn anything of note. It would also be incredibly time consuming and expensive. Instead, labeling companies lean on the degree as a proxy. It is also possible that writing sanctioned by elite programs has a more formulaic approach associated with certain norms and conventions in the field. Again, instead of acknowledging the fundamental limits of RLHF (or LLMs for that matter) in enabling “creativity”, the “expert” is used to make an unsound claim. When seen through Mulvin’s lens of a “proxy” stabilizing the unclear or sometimes unsound circumstances of origin, it becomes clear that more than anything the “expert” is a term meant to sweep questions of subjectivity under the rug. 

The last example I want to bring in from my interviews relates to the construction of “expertise” as a category. Bowker and Star point out that categorization is always an attempt to wrestle a diverse set of human experiences to neatly fit into one box or another. They use the examples of standardized tests to point out that students are either categorized as capable or left to the wayside depending on their scores on these tests (Bowker and Star 6). Here, people are accepted as “experts” based on the various credentials they have. The fragility of this category was revealed to me during an interview with another “general” expert, who described the process of fact checking claims made by the model. They narrated a case where the model made certain claims related to a cult rockstar from the 20th century. They looked for information to verify these claims from all the trusted sources (newspapers, magazine articles, documentaries) but were unable to find anything. They ended up having to use information from an “unreliable” blog that they didn’t trust. This story shows the weakness of this process on two fronts. Firstly, it shows that despite having unreliable information, this person’s input was accepted because of their categorization as an “expert”. Their expertise doesn’t change the fact that this information is unreliable in any meaningful way, but because they have already been vetted as an expert, this uncertainty is explained away and stabilized. Secondly, it also shows the limitations of fact checking, as not every anecdote, story or event that has occurred in human culture will have reliable sources detailing information about it. It will always fall upon humans to make the judgement of what to trust.

It is clear that RLHF is a practically useful, but incomplete way of transferring “expertise” to a Large Language Model. In their quest to reduce hallucinations, model developers can end up flattening and reducing the surface area of what they consider to be knowledge. Further, maybe it does make the model more “helpful”, but the question of who is trusted to “align” the model and impart their preferences to it, and who is excluded from this process looms large as these models diffuse through the workplace and other parts of social life.

Narratives around expertise in RLHF

Narrative plays an outsized role in the modern AI industry. As Narayanan and Kapoor note, the hype cycle around new AI technologies follows a familiar pattern, with each new development bringing with it a round of hype, only for many of those technologies to eventually fall to the wayside (Narayanan and Kapoor 229). RLHF is no different, this section describes the role of various actors; frontier labs, data labeling companies and the data labellers themselves, in creating a certain kind of narrative around RLHF. Reading papers and statements released by frontier labs (OpenAI, Anthropic and Meta), the role of experts in RLHF is conspicuous due to its absence in most materials. The role of human labellers is acknowledged briefly while the rest of the papers usually focus on technical interventions. The humans referred to are not distinguished in any way, creating an impression of humans described by Denton et al. as “faceless beings” that have been “dis-individualized” (Denton et al 10). The most direct reference to expert labellers I could find was a blog post on the SurgeAI blog written by a member of Anthropic praising SurgeAI for its high quality datasets and the presence of domain experts (Anthropic Surge AI Blog Post). There could be many reasons for this, perhaps expert data labellers are not as valuable to these frontier labs. However, this is not what we see in terms of their business decisions. For example, Meta recently paid $14.8 billion dollars to acquire a 49% stake in Scale AI, with Outlier AI (their expert-focused subsidiary) being one of the drivers for the deal. Further, SurgeAI recently became a unicorn startup without VC backed funding, indicating a high demand for its services (Yahoo Finance). These are just two among many companies raising large sums of money based on their promises of a network of experts and high quality data for RLHF. The disconnect between the number of mentions of experts in company documentation and the high demand for expert labellers can be explained by Tubaro et al's work on AI data labeling services. They presciently claim that AI companies often downplay the role of human labor at the “verification” stage, as any indication that AI output needs modulation or verification can lead to a lack of trust in the model (Tubaro et al 8). Therefore, the role of frontier labs in narratives around expert enabled RLHF is noticeable in their creation of a vacuum around acknowledging the role of these experts in boosting model capability. 

This narrative space is eagerly filled by expert data labeling companies. On its website, SurgeAI invokes in the space of a few sentences - Ernest Hemingway, Frieda Kahlo, John Von Neumann and the Reimann hypothesis. The website boasts of the “physicists, philosophers and poets” part of SurgeAI’s network, with the logos of various prestigious Ivy League universities flashing across the bottom. They promise their RLHF data “embodies the richnesses and subtleties of the world” (Surge AI Website). Outlier AI’s website is decidedly less grandiose, but shares the themes of credential signaling, emphasizing the PhDs and MA holders across several different fields. SurgeAI’s repeated invocation of Ivy League universities and other prestigious organizations brings to the front an interesting dynamic. Using a metaphor from performance studies, Newlands highlights that AI companies often sweep labor dynamics to the “backstage” in order to present a sanitized exterior. However, they may also strategically make labor “hyper-visible” , bringing it into the frontstage, if the work is considered to be “aesthetically pleasing” (Newlands 3). This is exactly what “expert” data labeling companies like SurgeAI, do where certain experts from prestigious colleges are plastered all over the front page of their website, a majority of their work happens through a network of “experts” around the world, some of whom rely on this work as their primary source of income, often dealing with uncertainty and precarity that comes with this kind of work. This narrative move gives RLHF a veneer of prestige by invoking prestigious universities while hiding away the real workforce. 

This narrative ecosystem, with frontier AI labs talking about the arrival of “PhD level AI”, while staying relatively quiet about the role of human labor, and expert data labeling companies filling that vacuum claiming that RLHF data can replicate the richness of human experience, fits almost too perfectly in the sociotechnical imaginary framework established by Bareis and Katzenbach. The myth of AGI grounds this imaginary, with the awe of the technological sublime (human experience at the highest level imbued in the model through statistical techniques) gives it wings. This leads to the expectation that “PhD level AI” (whatever that means) and by extension AGI is around the corner (Bareis and Katzenbach 6). This narrative structure clearly papers over the very real and clear epistemic limitations of RLHF, even with expert labeling.

The role of data workers

Understanding the experiences of “expert” data labellers as contract workers subject to fluctuating pay, uncertain workloads and precarious status is important to seeing the full picture of “expertise” and RLHF. Traditionally, data labeling has been considered a low paying job, outsourced to marginalized workers, often from the Global South (Muldoon et al 4). By emphasizing “high quality” data produced by “domain experts”, data labeling companies are attempting to change this perception.

However, across interviews with “expert” data labellers, as well as analyses of Reddit threads, it is clear that while experts are paid markedly better than their “unskilled” counterparts, there are striking similarities between their working conditions. All the interviewees emphasized that projects were difficult to come by, and months often went by without any income from this work. Out of the 5 interviewees, 3 did it part time for extra income. The one person I interviewed who did it full time was all too aware of the precarity and said they would advise against having it be the primary source of income. This is also borne out by testimonies on Reddit where people working steadily for an extended period of time suddenly run out of work. 

Further, all the interviewees agreed that this was an isolating job, with one of them even saying that too much discussion with other workers on internal company message boards was heavily discouraged and could lead to a ban. All interviewees expressed the mundanity of the job, claiming that it was repetitive and unfulfilling. Lastly, everyone I spoke to emphasized the automated nature of the entire process, with minimal human interaction from start to finish. This made it hard, they said, to contest decisions made by the platforms that could seem arbitrary. 

These similarities with “traditional” data work proves that “expert” data work is not a distinct category that is somehow imbued with legitimacy and authority, but rather a continuation of a longer trend of AI companies relying on unacknowledged labor. 

Another particularly damning point that becomes obvious after spending some time on subreddits dedicated to Outlier AI and DataAnnotation.Tech is the technical instability behind granting someone the status of “expert”. These subreddits have many instances of people waking up to find their “expert” status, established through either credentials or qualifying exams, removed, suddenly finding themselves back at the level of an “ordinary” data labeller. There are also conversely cases of people with established expertise in a certain field being randomly offered complex tasks in another field (Reddit threads #1,#3,#5). Workers report stress and anxiety caused by this constant shuffling, and it can hardly be a positive environment for expertise sharing. Calling back to Bowker and Star, these stories show just how much uncertainty is stabilized behind the category of the expert, and the gap between the claims made by these expert data labeling companies and the coherence of the “backend” of their projects.

Conclusion

While the research in this paper is drawn from a small sample size of interviews, a few patterns can be gleaned here. Firstly, it appears that while expert labeling led RLHF can introduce improvements in accuracy in certain domains, it also imposes a rigid emphasis on the one “correct answer” leading to epistemic flattening and reduction in other domains. Further, its claim to “model alignment” and better model “personality” is based on the use of fragile proxies. Secondly, this silence of frontier lab on the precise role of experts, and the lack of an accurate appraisal of the capabilities of expert driven RLHF, along with the attendant hype cycles propagated by data labeling companies lead to a confusing situation where it is increasingly difficult to tell precisely how this technique benefits LLMs and where it falls short. Lastly, although the category of the “expert” is constantly reified in promotional materials, the actual treatment of these experts is closer to the treatment of “traditional” data workers. They work in uncertain conditions, with a constantly changing supply of work and an entirely automated process, making it very hard to find recourse in the case of an unexpected situation. Taken together, these facts should make it clear to us that as LLMs appear to be increasingly capable of complex reasoning in multiple domains, and become more appealing as personal companions or conversation partners, there are still several epistemic gaps in their training process and we must not accept any output from an LLM uncritically. Most importantly, this paper shows that the role of humans in training these models remains indispensable. 

References:

Denton, Emily, et al. “On the Genealogy of Machine Learning Datasets: A Critical History of ImageNet.” Big Data & Society, vol. 8, no. 2, July 2021, p. 20539517211035955. DOI.org (Crossref), https://doi.org/10.1177/20539517211035955. 

Jaton, Florian, and Philippe Sormani. “Enabling ‘AI’? The Situated Production of Commensurabilities.” Social Studies of Science, vol. 53, no. 5, Oct. 2023, pp. 625–34. DOI.org (Crossref), https://doi.org/10.1177/03063127231194591. 

Kang, Edward B. “Ground Truth Tracings (GTT): On the Epistemic Limits of Machine Learning.” Big Data & Society, vol. 10, no. 1, Jan. 2023, p. 20539517221146122. DOI.org (Crossref), https://doi.org/10.1177/20539517221146122. 

Kang, Edward B. “On the Praxes and Politics of AI Speech Emotion Recognition.” 2023 ACM Conference on Fairness, Accountability, and Transparency [Chicago IL USA], 2023, pp. 455–66. DOI.org (Crossref), https://doi.org/10.1145/3593013.3594011. 

Newlands, Gemma. “Lifting the Curtain: Strategic Visibility of Human Labour in AI-as-a-Service.” Big Data & Society, vol. 8, no. 1, Jan. 2021, p. 20539517211016026. DOI.org (Crossref), https://doi.org/10.1177/20539517211016026. 

Porter, Theodore M. Trust in Numbers: The Pursuit of Objectivity in Science and Public Life. New edition, with A new preface by the author, Princeton University Press, 2020. K10plus ISBN, https://doi.org/10.1515/9780691210544. 

“Samples of the World Out There: The Surrogate Logic of Proxies.” Proxies, by Dylan Mulvin,  The MIT Press, 2021, pp. 1–34. DOI.org (Crossref)

https://doi.org/10.7551/mitpress/11765.003.0003. 

Suchman, Lucy. “The Uncontroversial ‘Thingness’ of AI.” Big Data & Society, vol. 10, no. 2, July 2023, p. 20539517231206794. DOI.org (Crossref),    https://doi.org/10.1177/20539517231206794. 

Tubaro, Paola, et al. “The Trainer, the Verifier, the Imitator: Three Ways in Which Human Platform Workers Support Artificial Intelligence.” Big Data & Society, vol. 7, no. 1, Jan. 2020, p. 205395172091977. DOI.org (Crossref), https://doi.org/10.1177/2053951720919776. 

Bowker, Geoffrey  C., and Susan Leigh Star. “Introduction: To classify is human.” Sorting Things Out, 29 Sept. 1999, pp. 1–32, https://doi.org/10.7551/mitpress/6352.003.0002. 

Ouyang, Long, et al. “Training Language Models to Follow Instructions with Human Feedback.” arXiv, 4 Mar. 2022, arXiv:2203.02155.

Bai, Yuntao, et al. “Constitutional AI: Harmlessness from AI Feedback.” arXiv, 15 Dec. 2022, arXiv:2212.08073.

Bareis, Jascha, and Christian Katzenbach. “Talking AI into Being: The Narratives and   Imaginaries of National AI Strategies and Their Performative Politics.” Science, Technology, & Human Values, vol. 47, no. 5, Sept. 2022, pp. 855–81. DOI.org (Crossref), https://doi.org/10.1177/01622439211030007.

Rousmaniere T, Zhang Y, Li X, Shah S. Large language models as mental health resources: patterns of use in the United States. Practice Innovations; 2025; published online July 7. 10.1037/pri0000292

Narayanan, Arvind, and Sayash Kapoor. “Why Do Myths about AI Persist?” AI Snake Oil: What Artificial Intelligence Can Do, What It Can't, and How to Tell the Difference, Princeton University Press, 2024, pp. [227-257].

Siegel, Ethan. “Why ‘Vibe Physics’ Is the Ultimate Example of AI Slop.” Big Think, 30 July 2025, bigthink.com/starts-with-a-bang/vibe-physics-ai-slop/. 

Lu, Yiwen. “How A.I. Chatbots Are Trained: Inside the Race for Subject-Matter Experts.” The New York Times, 10 Apr. 2024, www.nytimes.com/2024/04/10/technology/ai-chatbot-training-chatgpt.html

Muldoon, James, et al. “A Typology of Artificial Intelligence Data Work.” Big Data & Society, vol. 11, no. 1, Mar. 2024, p. 20539517241232632. DOI.org (Crossref), https://doi.org/10.1177/20539517241232632.

Barr, Alistair. “‘The Trillion-Dollar Question’: How Did Anthropic Make Ai so Good at Coding?” Business Insider, Business Insider, www.businessinsider.com/anthropic-ai-breakthrough-vibe-coding-revolution-2025-7. Accessed 11 Dec. 2025. 

“Surge AI Quietly Hit $1B without Outside Money - Now Even VCS Want In.” Yahoo! Finance, Yahoo!, finance.yahoo.com/news/surge-ai-quietly-hit-1b-150057861.html?guccounter=1. Accessed 11 Dec. 2025. 

Websites:

Outlier Website -> https://outlier.ai/

Surge AI Website -> https://surgehq.ai/

Outlier Job Board -> https://app.outlier.ai/en/expert/opportunities?location=All&type=All

DataAnnotation.Tech Website -> https://www.dataannotation.tech/

Anthropic Surge AI Blog Post -> https://surgehq.ai/blog/anthropic-surge-ai-rlhf-platform-train-llm-assistant-human-feedback

Reddit Threads:

  1. https://www.reddit.com/r/outlier_ai/comments/1emlqiu/from_bio_expert_to_t1_generalist_fml/#:~:text=For%20over%20a%20year%2C%20I,allocated%20to%20a%20generalist%20project?
  2. https://www.reddit.com/r/outlier_ai/comments/1k4lxdx/i_should_be_saying_this_but_i_dont_understand_why/
  3. https://www.reddit.com/r/outlier_ai/comments/1ltksk6/did_anyone_just_lose_oracle_status_randomly/
  4. https://www.reddit.com/r/outlier_ai/comments/1de89ew/whats_a_creative_writing_specialist_like/
  5. https://www.reddit.com/r/outlier_ai/comments/1dlgwmd/so_they_are_just_inviting_random_people_as_experts/