Logical Reasoning Questions for CLAT | QB Set 39

AI needs cultural policies, not just regulation

The future of Artificial Intelligence (AI) will not be secured by regulation alone. To ensure safe and trustworthy AI for all, we must balance regulation with policies which promote high-quality data as a public good. This approach is crucial for fostering transparency, creating a level playing field, and building public trust. Only by giving fair and wide access to data can we realise AI’s full potential and distribute its benefits equitably.

Data are the lifeblood of AI. In this regard, the laws of neural scaling are simple: the more, the better. The more volume and diversity of human-generated text is available for unsupervised learning, for example, the better the performance of Large Language Models (LLMs) will be. Alongside computing power and algorithmic innovations, data aquisition are the most important driver of progress in the field.

A data race at the expense of ethics

But there is a problem. Innovation does not produce enough digital content to feed these ever-growing beasts. Current training datasets are not growing. Huge beasts like Llama 3, for example, is trained on 15 trillion tokens, quietly voraciously optimises the limits of new data, but there is a point at which we may run out of pristine training text. The model’s ability to improve and produce reliable results is directly linked to the quality and diversity of its data.
Primary sources are missing. Most LLMs are not trained on the huge wealth of neglected archives and cultural data available in libraries, museums, and public repositories. Regulation alone cannot address the relentless recycling of secondary sources and out-of-date knowledge that limits the innovation capacity of AI.

The absence of primary sources

The notion that LLMs are trained on a universal compendium of human knowledge is a fanciful delusion. Current LLMs are far from the universal library envisioned by the likes of Leibniz and Borges. While stashes of stolen scriptures like “Books3” may include some scholarly works, these are largely secondary sources written in English; commentators estimate they barely scratch the surface of human culture. Conspicuously absent are the primary sources that mirror myriad communities and their own documents, traditions, stories, and identities.

These documents represent an untapped reservoir of linguistic and cultural data. They have the potential to help address the current bias and narrowness in AI systems and can serve as the backbone of future language models. Scholars and archivists have demonstrated how the digitization of such sources can make AI a more equitable and valuable resource. Digitizing and integrating this data into training sets would not only help correct existing imbalances but would also mitigate harm from neglecting, and ultimately destroying, much of the world’s cultural heritage from negligence, war, and climate change. They also promise significant economic benefits. As well as helping neural networks scale up, their release into the public domain would mean that smaller companies, startups, and the open-source AI community could use these pools of free and transparent data to develop new applications, levelling the playing field against Big Tech while fostering innovation on a global scale.

Examples from Italy and Canada

Advances in the digital humanities, notably in Italy and Canada, have drastically reduced the cost of digitisation, making it possible to extract text from manuscripts and to extract text from manuscripts for AI training. Italy’s rich and ancient literary archive is now being digitized for AI, with the “Digital Library” project potentially earning £500 million from publicly funded Shortlisting schemes as part of the “Next Generation EU” project. Unfortunately, this is not the norm. Advances in digitising Italy’s rich multilingual archives are being emulated in Canada, which has produced similar results for indigenous and minority languages.
Efforts in Spain and other countries are also beginning to address the gaps. Digital Humanities projects have digitized the Spanish Cortes and other key regional documents, making them available for AI and digital scholarship. Language technologies must continue to move toward models that represent the specificity of their source cultures, ensuring more robust and innovative AI knowledge.

Question -1) Which of the following best captures the main argument of the passage?

A. AI progress should be regulated to ensure ethical use of data.
B. AI development relies solely on algorithmic innovations and computing power.
C. The digitization of cultural heritage can enhance AI and promote equitable access to data.
D. Current LLMs are trained on a comprehensive database of global knowledge.

Question -2) According to the passage, what is the primary issue with current AI training datasets?

A. They are too small to be effective.
B. They rely too much on human intervention.
C. They lack diversity and can amplify biases.
D. They contain too many primary sources.

Question -3) What does the passage suggest about the future availability of high-quality training data?

A. There will be an abundance of high-quality data available.
B. We may reach a point where there is not enough pristine text for AI training.
C. The amount of high-quality data will remain constant.
D. The quality of data will continue to improve without intervention.

Question -4) What example does the passage provide to illustrate the potential benefits of digitizing cultural heritage?

A. The use of AI in autonomous vehicles.
B. The Digital Library project in Italy.
C. The development of social media platforms.
D. The advancement of AI in medical research.

Question -5) Which of the following statements is NOT supported by the passage?

A. Primary sources are largely missing from current LLM training datasets.
B. Regulation alone is sufficient to ensure safe AI development.
C. Public data contamination by LLMs is a concern.
D. The digitization of low-resource languages offers significant benefits.

Question -6) Based on the passage, what role do primary sources play in the context of AI training?

A. They are the main focus of current AI training datasets.
B. They are mostly ignored but represent a valuable reservoir of data.
C. They have been fully digitized and integrated into AI systems.
D. They are considered less important than secondary sources.


Calling all law aspirants!

Are you exhausted from constantly searching for study materials and question banks? Worry not!

With over 15,000 students already engaged, you definitely don't want to be left out.

Become a member of the most vibrant law aspirants community out there!

It’s FREE! Hurry!

Join our WhatsApp Groups (Click Here) and Telegram Channel (Click Here) today, and receive instant notifications.

CLAT Buddy
CLAT Buddy