Hamburg Regional Court Rules on the Creation of a Training Data Set
An article by Lea Sophie Singson, Legal Data Steward at FAIRagro, Lea-Sophie.Singson@fiz-Karlsruhe.de
Promoting innovation without legal uncertainties – that was the declared aim of the exemption for text and data mining (TDM) in copyright law in 2018. A lot has happened in the past six years: The possibilities of ‘artificial intelligence’ (AI) technologies have steadily increased, yet numerous legal issues remain unresolved.
In the ‘LAION’ case, the Hamburg Regional Court was the first German court to address the question of whether the TDM barrier also applies to training data sets for AI programs. The context of the proceedings and the judgement handed down at the end of September are explained and classified below.
Incidentally, the court left one important point untouched. It did not rule on whether the training of generative AI generally falls under the TDM barrier or not. Furthermore, the court did not make a final decision on what an opt-out would have to look like in order to be considered ‘machine-readable’.
Creator versus Non-Profit Organization
Initial point for the proceedings was a complaint by a photographer. He had published photographs he had taken himself on a stock photo site. In principle, anyone could download the images from this site for a fee. In addition, watermarked versions of the photographs could be viewed publicly.
On the other side is LAION e.V., a non-profit organization that claims to train large-scale machine learning models. These are AI models that ‘learn’ on the basis of very large amounts of data, i.e. recognize patterns and then use them to make predictions or decisions. Together with data sets, LAION wants to make this data available to the general public free of charge and openly licensed under the CC BY 4.0 license.
Not all Training is the Same: The Actions in Dispute
The main point of contention centered around the following question: Is it permissible TDM if a data set with text-image pairs is created that is subsequently used to train AI? Or does this constitute an unlawful reproduction of works under copyright law?
The case concerned a data set containing hyperlinks to images with corresponding descriptive texts. LAION wanted to make this available to the general public free of charge for the training of generative AI and licensed it under CC-BY 4.0. For the dataset, publicly available images on the internet – including a photograph created by the plaintiff – were extracted together with their URLs and the textual description of the image content and checked for consistency using software. If the image and image description matched, the pairs were included in the data set.
The website featuring the plaintiff’s photo actually prohibited the automated download of the works. This was another reason why the photographer criticized the legality of this action.
Regional Court Decision: In Favor of Text and Data Mining
So much for the context – what exactly did the Hamburg Regional Court decide? The three most important judgement calls are as follows:
1 – It is permissible TDM to create a dataset intended to be used for AI training.
2 – The intention to make a dataset available in open access is sufficient for TDM to be considered for the purpose of scientific research.
3 – No decision, but judicial opinion: a reservation of use against TDM acts of automated programs in natural language is sufficient.
What do these three decisions mean in detail?
1 – Creation of the Data Set is Text and Data Mining: With Support from EU Law
The court relies on the following understanding: ‘Text and data mining’ (TDM) is the software-supported analysis of large amounts of data. The information obtained by means of TDM provides information about patterns, trends and correlations in the data material.
According to the decision of the Hamburg Regional Court, the creation of the data set by LAION fulfils these requirements. This is because it was precisely the aim of the data set to enable the analysis of correlations between the pairs (match or no-match).
The court also referred to the EU AI Regulation introduced in August 2024 (the so-called ‘AI Act’). It can be read from this that the European legislator also applies the TDM restriction rule to the creation of data sets for the training of AI.
The court also makes it clear that the creation of the dataset is considered permissible TDM. This is irrespective of the question of whether the subsequent AI training with this data set itself is covered. It justifies its decision on this point by stating that the admissibility of the creation of the data set cannot be made dependent on whether future technologies may use the data set outside the limits of legally permitted TDM to train AI. This is fundamentally a sensible measure. Otherwise, there would be no legal certainty for any form of permissible TDM.
2 – Open Access is Sufficient as a Research Purpose
Offering the dataset to the general public free of charge was sufficient for the court to assume a TDM for the purpose of scientific research in accordance with Section 60d of the German Copyright Act, as the creation of the dataset can contribute to a gain in knowledge.
The court also clarified that LAION can operate a TDM for the purpose of scientific research, regardless of whether the organization conducts scientific research in general. According to the court, the data set may also be offered to commercial users without falling outside the scope of the restriction.
If the decision of the Hamburg Regional Court stands, it would lead to a significant strengthening of those who produce data sets for TDM purposes free of charge or open access.
3 – In the Opinion of the Court: Effectively Declared Opt-Out
Even though it had no influence on the outcome, the court’s decision dealt with the opt-out declared on the website. This is a provision from Section 44b (3) of the Copyright Act. ‘Reservation of use’ means that a rights holder can prohibit the use of TDM, i.e. reserve the right to use it.
The legal regulation raises the question: How must the reservation of use be structured? Is it sufficient for a machine to be able to read a reservation of use? Or must it also be able to understand it? Or to put it more narrowly: Is a reservation declared in ‘natural language’ sufficient or must it be in a technical format (such as robot.txt)?
The Hamburg Regional Court took a stance here – at least in part: it indicated that a reservation of use written in natural language is sufficient for machine readability. The responsibility here lies with AI developers: They would have to use state-of-the-art technologies (which they are obliged to do under the AI Act). Since modern AI applications can already understand and process natural language, these programs could, in the opinion of the court, also recognize and take into account a reservation of use written in natural language.
These relevant Questions remain unanswered
With its judgement, the Hamburg Regional Court ruled on a number of virulent questions relating to copyright and artificial intelligence. Other questions, however, remained unanswered: For example, it remains for future courts to decide whether the training of AI per se falls under the TDM barrier and what requirements are now actually to be placed on an op-out (see above).
Why the fundamental Question of AI Training remains open in the Decision
In civil procedural law, the so-called disposition maxim applies. It gives the parties – i.e. plaintiff and defendant – the right to determine the initiation, subject matter and premature termination of the proceedings themselves. Section 308 (1) of the Code of Civil Procedure also expresses this principle. According to this section, the court may only decide on what has been applied for and not beyond the application. Therefore, if the parties dispute the legality of the creation of the data set, but not the subsequent training of an AI with this data set, the court is also barred from ruling on this.
Despite the restrictions, the decision of the Hamburg Regional Court definitely sheds some light on a highly contested field in which there currently seems to be a lot of movement.
It is to be expected that the case will also be taken to higher court instances. Due to the strong reference to EU law, there may even be hope of a decision by the European Court of Justice (ECJ).
This article is licenced under CC-BY 4.0.