ChatGPT as a really big threat, because it always produces good text

Sascha Schönig: How do you use LLMs in your academic work? What are advantages and disadvantages?

Jörg Pohle: My work relates to AI and LLMs in several ways. On the one hand, I teach about AI and LLM systems, about how they are built and used, about the academic discussion around those technologies and how they might change the academic enterprise in future.

On the other hand, I use AI tools quite a lot in my own academic work. For example, for translations. Of course, it cannot deliver a perfect translation but I use it rather to understand the general picture. For example, recently during a workshop I noticed that the information about a historical figure whose work we discussed was available only in Swedish. So, ChatGPT provided me with a rough translation, which gave me the opportunity to understand her work, and her research interests. 

Then I use ChatGPT in my search for academic literature. The chatbot is based on a language model that has been trained on a lot of texts. Sometimes, it helps me to understand the connection between different concepts. Of course, we should keep in mind that it’s a probability model, and it shows how the concepts can most likely relate to each other. It provides you with a high probability with regard to the training data; and basically shows you how likely it is that these concepts are related to other ones in the field. This way it is easier to spot new terms that are also relevant in my field of interest, but which I wouldn’t have thought of myself. That’s the added value to a Google search. Google is more focused on the exact keywords you use, it won’t necessarily point to new, different ones that are also relevant and are being referred to in your field. So, ChatGPT basically broadens my scope, and gives me new terms that can be used afterwards in a more “traditional search.”

Sascha Schönig: How are you dealing with the biases and errors ChatGPT produces? I remember the 3.5 model produced a lot of biases and errors, for example when you ask for the curriculum vita of a scientist, even a well-known scientist, half of it is made up?

Jörg Pohle: I’m not really asking for something that you would need an understanding of to answer that question. Instead I’m asking what kind of other terms and concepts are used with regards to concept ‘X’. It is a probability distribution with regards to the next word, if there are terms that are used more often with regards to the term that I entered, it would show in the result. And that’s simply because that’s how they trained the language models. It is trained from words that are related to each other because they appear near to each other in the same text. By asking for something that is related, I’m pretty much just prompting this mission to something that it actually has already input on. Something that is related to ‘X’. For example, I’m looking into how other disciplines, other than psychology, refer to something like mental models. So mental models are a concept in psychology and refer to how individuals create an understanding for themselves. In other disciplines, especially in sociology, they’re referring to something roughly similar with different terms and different concepts and they’re also referring to something that is not just individually, but collectively produced. Mental models as understood in psychology are not collectively produced, instead they are within individuals. So I’m looking for other concepts that roughly describe something roughly similar to mental models, but then from the perspective of something like sociology, and if I enter the proper terms and the proper phrase as a prompt, it doesn’t really matter whether it’s totally correct or not, instead it will produce words that are more probable within that space of words in that field. And that means I get results that are good enough to use for further research on looking for what a term might mean in this or that kind of research, how it is understood, or how it is differentiated from another term. Therefore, I don’t have this bias problem in that research. I don’t have hallucination or bias problems because I’m not looking for the truth. I’m just looking for connections, relationships between words that are used in the same field or in a similar one, to refer to a similar or roughly similar phenomenon in the real world. I look for terms and maybe other concepts in that space. And then I can use those terms to do a search on the literature. 

Sascha Schönig: Our colleague Theresa Züger, who I interviewed before, was talking about ChatGPT and she said that for most use-cases people overestimate what it can do and see it as a magical tool. I think it’s very important what you said before, understanding how the LLM was trained and what it is good for and how we can use it for our research. Often people who speak about ChatGPT and large language models have little idea of what their purpose is, or how they were trained. Then there is a lot of discussion about the biases and errors it produces but that’s because it is not used in the way it was intended. 

Jörg Pohle: The problem is actually with the original data the LLM was trained on and not the purposes the LLM was trained for. They are trained to be a general model of a language that, for example, allows you to create text that looks good enough to be taken as human output. So it’s not wrongly used with regards to the purpose that companies aim to achieve, but it’s maybe not really helpful with regards to the underlying data that was used to train it. And the data has specific characteristics. For example, in academia we know how academic discourse works and how papers are produced and what the relationship in a paper is between the introductory section, the methods section and so forth. If you understand how text is generated in academia, then you can see how that might influence the way that the model is operating in the very end. Some of the implications are really simple because there’s much more text on superficial stuff available, because for example, everyone who’s working in a specific, very narrow field is basically writing a roughly similar introduction in every paper, because they’re locating themselves in a field, for example, political science. The writing in those introductory sections is very general, maybe even superficial – at least in comparison to the much deeper dive in the main parts of the paper – but everyone is doing it and therefore the model produces lots of text on this general level. But every research question is very specific and particular to that paper, and it’s different from other papers. That means the data on the specific research question is actually rare, in contrast to the introduction. The introductions, at the same time, are more superficial than the in-depth research that the paper is actually about. And that means you end up with a large amount of text that is superficial, and smaller amounts of texts that are related to specific research questions in a specific field. And if you do understand this kind of relationship between probabilities and text production in academia, then you know that you should try to prompt a system in a way that it doesn’t end up just in this kind of introductory space – if you would use a spatial metaphor for how the language is distributed in the ML model – or superficial Wikipedia reproduction space. Where the probability is really high, all you get is output of all the same kind of superficial stuff. And if you understand how that works, you can make use of these characteristics of ML models by using proper prompting to generate texts that are actually in depth.

Sascha Schönig: Most of the things you talked about are the positive aspects of AI, such as it being a time saver, or assisting with , doing research or translations, or even facilitating access to texts that are not available in the language you’re capable of understanding. That’s also a finding of the paper and of the Delphi-study which was conducted last year by you and a bunch of colleagues. There was also a statement in the results section, declaring that the use of AI is like a double-edged sword. Now after some time has passed from last year would you still say AI is that double-edged sword? 

Jörg Pohle: Most of the negative characteristics that we have identified either in the literature or in the survey are things that I haven’t encountered yet, referring to what we talked about before, such as results that might be completely made-up. That’s all true, that all might happen. Personally, I haven’t encountered this in my own work. First because I haven’t yet seen a text that was generated and submitted to, for example, a journal where it was successfully peer reviewed and published. Regarding hallucinations, I just don’t use the system for learning something or for understanding. Thus, I don’t trust the output in the sense that it is correct – and I don’t need to trust the output. It is just probable from a language perspective. It’s a probable next word that is generated and then the sentences are produced by generating sequences of probable next words. If you understand that and use the system accordingly, you don’t have a problem with made-up answers or hallucinations. Because you’re using ChatGPT as a kind of search tool to search within some space of written text, basically a big data search, on all kinds of articles, and it looks for recurring terms and then reproduces them in a form of proper sentences. That’s good to read, and if you don’t take it as truth or a reproduction of research from other people, but just as a probability distribution, it is good enough for what I want to do. On the other hand, I certainly agree that it’s a double-edged sword. But that’s actually true for many or all computer systems or computer-based systems: they have good and bad characteristics, and ambivalent implications. For example, why do we regulate the export of weapons and computer chips? They can be used both civilly or militarily. 

Sascha Schönig: I know what you want to say. The applications are inherent to the technology and at the same time dependent on the user. I could take a spoon and try to scoop your heart out, whether that makes sense or not. But theoretically someone could do it. 

Jörg Pohle: Exactly. And on the level of the design of the spoon, you can’t change anything about this possible use. I can’t ask the people producing the spoon to make the spoon so that you can’t use it as a weapon. Because then you might not be able to use it as a spoon anymore. And that’s basically nothing that’s in the spoon, but actually in the intentions, uses and practices of the users.

Sascha Schönig: Or you use AI with badly constructed prompts. I can imagine that the results contain some false information with regards to fake news and misinformation?

Jörg Pohle: Maybe the fear of the spread of misinformation is a child of the Zeitgeist right now. I think there is a very broad presumption about disinformation or propaganda, even within academia or the broader scientific field, that pseudoscience is actually always also badly written. This presumption that actually everything that’s wrong is also badly written is widespread. Many people think that actually only the good science or only the truthful information are what’s properly communicated, and all this bad stuff is either wrong or misleading. If you have this presumption then you will see ChatGPT as a really big threat, because it always produces good text. That means that it becomes hard to distinguish correct from false information. But I’m not sure that this presumption is actually correct. I think it is maybe quite self-serving of the people who have this presumption. I think it is at least partially if not fully wrong. There is no quality difference with regards to the language or the text that is produced. ChatGPT presents the ‘bad’ information in the same more or less perfect language as it presents the ‘good’ information. So, people think they can look at one well-written text and then deduce from it that it must be true. Something like: a good formulation implies truthfulness. But that’s not and has never been true. I think that might be the fear behind the scandalisation.

Sascha Schönig: What would be a strategy to counter those fears? Is it to teach academics how to properly use AI and large language models?

Jörg Pohle: Learning how to think and construct your prompts around the aforementioned logic would be something that you could do on the input-side of academic text production with LLM-based systems. But it’s also about how you read texts from others. If you don’t know whether their texts have been written by ChatGPT or by a researcher then maybe you should be critical and shouldn’t deduce from the fact that it’s well written, that it’s good or trustworthy. On the other hand, it means that if you read badly written text, then maybe it’s not automatically bad science, it’s maybe just badly written. There is something that we found in the literature review for this paper: many of the mechanisms that have been developed to automatically detect AI-generated academic texts actually relate to characteristics of the language that are also typical for non-native speakers. That is something that then reproduces biases and discrimination on a totally different level because now everyone who is submitting something as a non-native speaker is under this suspicion of having used AI to generate their text because it reads like AI-generated. And it reads like AI-generated because most of the texts on which these models have been trained were actually written by non-native speakers. As they say: the lingua franca of academia is broken English – and if people produce texts that are used to train machines, the machines will reproduce something that has been created by non-native speakers and that might then show in the text that is generated. So I think you should take care when you use the system to create output whatever that is. To do research you should understand how LLMs are trained. You should especially understand the texts that have been used to train models, how those texts have been produced and how that shapes these texts, whether they are academic texts or not. And you should also learn how to cope with the fact that you actually cannot easily deduce the quality of the content from the quality of the language. That may be something that you might have been doing in the past, though it might have been questionable all the time. You should be critical of the content, independent of the quality of the text.

Sascha Schönig: That’s a very important statement. In the discussion section of your paper is a call for regulation and it’s also a take of our colleague Teresa Züger. Where should we regulate AI? To specific use cases? Should researchers be schooled on how to use AI? So what is your take on that?

Jörg Pohle: The discussion in the paper was mostly about what we learned from this survey. And most of the people in the survey said that it should be regulated or regulated more. There is regulation that primarily addresses those who develop and train systems like these, the companies who employ the developers. And these actors are not necessarily the same as those who use them. Systems are usually trained more than once. ChatGPT is trained at least twice: the first training is for the LLM itself to generate a large language model that is able to generate text, called domain training. Then it is retrained, which is called fine-tuning, to be able to act as a conversation agent so you can actually communicate with it using a natural language – you can prompt it and it gives answers. In order to have a good large language model, you need lots of texts as training material. This makes it hard to actually guarantee the quality of the input because, as we already discussed, the quality of the language is not necessarily related to the quality of the content. And then you might end up with a large collection of data that combines good quality language with bad quality content, and that’s actually good for the training because then it will lead to a language model that is able to produce good quality output, at least language-wise. But the price you pay comes from the bad quality content: the LLM produces hallucinations (actually, it’s not a proper ascription – instead, it’s a erroneous psychologisation of a machine), or it produces some other bad quality content. Unfortunately, that is on purpose. The main point here is to get good quality text out of the machine, and that’s what they deliver – and they’re good at that. I’m not really sure if it makes sense to regulate that, because otherwise you don’t get good language models at all. Maybe it’s good to have better fine-tuning, thus, maybe regulate the fine-tuning. Or maybe regulate the fields of application that you can use them in – for example a model that has been trained on anything from the internet shouldn’t be used in medicine. Maybe a model that is really good at answering general questions on the level of, say, Wikipedia can be used as a first-level support for people, maybe even without any regulation. But when people’s questions go deeper or have greater implications, the models put to use for them should be more regulated. It is quite hard to clearly see where these tools are just useful and where the limitations outweigh the promises. It’s always a kind of very blurry boundary between the two, the good and the bad. It might even be within a sector, even within a use case. And maybe there are issues that don’t require regulation, but some kind of critical understanding that users should have when using these models or systems that employ them. One thing is that it doesn’t really matter whether bad text or bad content was created by a human or by a machine, if it’s bad, it’s bad. And people should be able to detect that by reading a text or by watching a video.

That requires education – and much more general education than just AI education or digital literacy. It’s about learning how to understand content and substance, and being able to differentiate between the style of the text, the video or any other format as well as the quality of the writing, narrating or presentation and the content that is transmitted by the text, video or whatever.

Sascha Schönig: That’s a wonderful last word, thank you very much for your time.

Related Content

Making Science Fun!

still no jabs for ‘most at risk’ children in DRC

Why African men should get tested for prostate cancer

Leave a Comment