Article image created using StableDiffusion 2.0
One of the most active areas of AI currently is the field of Large Language Models (LLMs). There is also plenty of controversy regarding these models and their use, witness the recent headlines about Googles LaMDA being believed to be sentient, Meta’s Galactica model designed to answer scientific queries and a class-action lawsuit brought against Githubs’s Co-pilot service for AI code writing to mention just a few examples. I will take a look at IP-related aspects of this below but first a little primer explaining the engineering facts behind LLMs.
LLMs are ginormous machine learning models with many billions of parameters. They are simply huge. There are many different flavours of such models, but they all have in common that, in essence, they learn probability distributions over sequences of words or, in other words, statistics about the structure of language. This can then be used for many different tasks, such as querying large bodies of text using a prompt the model completes, generating advertising copy or software instructions based on a prompt, classifying text and many more.
To understand a bit more what LLMs are, it is best to understand how they are trained to acquire these distributions over text sequences. There are many different ways this happens, but also more commonalities than differences. It is therefore useful to look at how one particular model is trained as an illustration. Taking BERT, a model developed by Google based on the Transformer architecture, as an example, BERT is trained by feeding it text from the entire Wikipedia corpus and getting the model to perform two tasks: filling in words that have been masked in the input and determining if two sentences correctly follow each other or are out of sequence. Details of how this works are in the linked explanation but in a nutshell, a training signal is generated that measures how well the model does at filling in the masked words and rating whether pairs of sentences correctly follow each other or not. This training signal is then used to adjust the parameters of the model until the model gets very good at this particular task.
As of themselves, the basic tasks used to adjust the very large number of parameters tend to not be very interesting, as in the case of BERT. The interesting bit is that by adjusting the parameters of the model to do this narrow task well, the model distils information about the structure of language in its parameters. Even without further training, this enables LLMs to complete an initial text, known as a prompt, in interesting ways (zero-shot learning). In addition, these models can subsequently be finetuned for more specialised task by adapting the model parameters further with a training signal that is designed for the specialised task, for example identifying emotions in a chatbot exchange or completing software code prompts. Even more sophisticated methods have recently added an element of reinforcement learning (see our technical primer), for example the recent ChatGPT model.
But the really important take-away from all this is that fundamentally “all” that LLMs learn is to produce an output that is likely to go together with the input under the distribution of text the LLM has learnt from. When an LLM generates an output that has the appearance of the LLM understanding what its input is, it is in fact at best a simulation of understanding the meaning of the input based on producing statistically likely outputs. When Google’s LaMDA convinces a software engineer it is sentient or Deepmind’s AlphaCode puts in a respectable performance at coding competitions, it is tempting to see true intelligence at play. Understanding how these models work serves us well to remember that these and other models do not understand what they are doing in the same way humans do, by applying reasoning and logic to facts. But you may wonder, why is a patent attorney going on about all this?
Well, there are a number of reasons. On the one hand, I simply am fascinated by what can be achieved by these models but also by the extent to which they are misunderstood. Understanding where these models come from helps understanding their limitations, of which there are many, from reflecting the prejudices and bias in their training data to simply making stuff up that sounds authoritative but is plainly wrong. The recent controversy surrounding Meta’s Galactica model that was released to claims of providing scientific understanding and turned out to be deeply flawed if presented with “out of distribution” prompts is a good illustration of this. For an intuitive understanding of how these things can go wrong, take a look at the as-ever enlightening and amusing AI weirdness post on this.
On the other hand, there are some super interesting IP aspects to grapple with in the context of LLMs and their uses. On the patenting side, while the EPO considers linguistic to be an abstract theory, which can make it difficult to patent in the field of language processing, any improvements to the underlying machine learning technology that is motivated by the way computers work are in principle patentable at the EPO. One aspect of machine learning improvements, in particular in connection with the transformer architecture that is used by most LLMs, is the parallelisation of the machine learning algorithms, that is arranging algorithms so that operations can be performed in parallel to speed up processing. This is a trend underlying many of the advances that enable models with billions of parameters to be trained. The EPO recognises parallelising computation as one of those aspects of algorithm design that are motivated by how computers function and hence capable of contributing to the all-important technical effect (see my previous piece here), so this is a potentially rich field for patenting activity.
And outside of patents, the wider IP implications are even more interesting. The training of LLMs involves access to copyright material, for example the content of (English) Wikipedia in case of BERT. Such access may be licenced, for example under the Creative Commons licence in case of Wikipedia. Licences like Creative Commons require attribution of content that is re-distributed, which may not cover copying text from Wikipedia to generate training data that is itself not distributed. For other content used for training LLMs, in Europe we have a copyright exception for text and data mining that would cover this sort of thing unless prohibited by a right-holder with appropriate means and in the US, copying for the purpose of generating training data may be exempt from copyright as fair use (or may not be as this point seems to no have been tested by the courts – see below). Then there is the question as to who owns the copyright in content generated by LLMs in response to input prompts. This is currently a hot topic of debate and something I will look at again in the future.
How copyright content used to train LLMs are treated in the US may soon be clarified if a recently filed class-action suit goes ahead. The class-action concerns Github’s Co-pilot, involving a LLM generating code suggestions for you as you type your own code. Co-pilot is based on OpenAI’s Codex model and has been trained on the corpus of open-source software uploaded to Github. While open-source software comes with copyright licences that make it free to use, these licences come with conditions attached. Most importantly, open-source licences typically require attribution to the author and that a licence notice remains attached to the software. See the reports of this law suit here and here. Many points of law are raised, including whether copying copyright material for the purpose of training a model is fair use or not, but from the perspective of this piece the most interesting point could be whether it is a copyright infringement when a LLM outputs a significant portion of a copyright work in response to a prompt.
As it happens, when Co-pilot suggests code to its users, it seems that it sometimes outputs verbatim portions of copyright works it ingested during training. On the assumption that copying such portions would be copyright infringement, does the process of training a LLM on a large corpus of documents followed by generating an output that corresponds to a portion of a document in the corpus represent copying and hence copyright infringement? The case would be clear cut if the LLMs worked by accessing document in a database, extracting the portion in question, and reproducing it. But, as explained above, this is not the case. Rather, LLMs produce an output that is likely (under the distribution of the training data) to go with an input prompt. The suit alleges that some outputs contain tell-tale signs of copying, for example containing the same errors as the source material, and therefore appear to be copied. However, this is not how LLMs work, there is no direct access of works and copying portions of them, per se. Rather, as explained above, outputs are generated as likely sequences of words given the input prompt and parameters of the model. Does this amount to indirect copying? Or is it simply the case that the code portion in question is the most likely expression of the idea corresponding to the prompt provided as an input? And if so, is this still copying, even if it can be established that the model has seen the original code portion in its training data?
A lot of copyright cases turn on the evidence of copying having occurred as a question of fact. Here the mechanism by which the alleged copy is generated is presumably known, and so it seems that the question that will have to be answered is whether that mechanism of generating textual output with LLMs amounts to copying or not. Besides the legal interest in how concepts like “copying”, “idea” and “expression” are applied in the context of LLMs, the Co-pilot case could have profound consequences for how LLMs can be trained (at least in the US) and I will be sure to follow and report on its progress in this newsletter.