Ilya Sutskever, a co-founder and former chief scientist of OpenAI claims, “Data is the fossil fuel of AI, and we used it all!”
Is this aphorism true?
This claim is made in the context of explaining that the limitation for AI (LLMs in particular) lies in the quality of data required to mimic intelligence—a limitation often referred to as the ‘entropy gap’.
Entropy, in information theory, is the measure of uncertainty or unpredictability in a probabilistic system. In the context of AI, entropy quantifies the variability and richness of information within a dataset, reflecting how evenly distributed or diverse the data points are across possible outcomes. This diversity contributes to the uncertainty by ensuring the dataset encompasses a broad range of patterns or features for the model to learn from.
I would define what I call the ‘entropy gap’ as the difference between the variability and richness of patterns present in the training data and the variability required to mimic the complexity of human intelligence or real-world scenarios.
In AI, this gap highlights a mismatch between the diversity and uncertainty present in the training data and the broader, more unpredictable diversity the model encounters when deployed in real-world environments.
The more significant this entropy gap, the less capable the model becomes in generalizing to unseen data, adapting to novel conditions, or achieving meaningful performance across diverse tasks.
Bridging this gap requires not just more data but better-quality, contextually rich, and diverse datasets that reflect the complexity of the tasks AI is designed to perform.
With this perspective in mind, saying ‘Data is the fossil fuel of AI, and we used it all!’ amounts to equating qualitative, contextually rich, and diverse datasets to a finite resource, akin to fossil fuels.
This, however, is not true. Unlike fossil fuels, which are universally finite, the scarcity of quality data is highly context-dependent. In certain domains, such as rare disease research or specialized industrial applications, the availability of task-specific, quality data may be perceived as limited.
Yet, this scarcity can often be addressed through methods like synthetic data generation, data augmentation, or transfer learning, which allow for the refinement or expansion of available data.
It has to be acknowledged that these methods are not universally applicable. For instance, synthetic data may not fully capture the nuanced characteristics of real-world scenarios, and transfer learning can struggle to generalize across significantly different domains. Additionally, neither method adequately resolves issues of bias or fills gaps in highly specialized or ethically sensitive datasets, where precise and context-specific information is critical. These limitations underscore the importance of careful data curation and domain expertise in overcoming quality data scarcity.
No matter what, human-generated data is inherently renewable resources for AI as it is continuously generated by human activity, technology, and the environment, unlike fossil fuels which are finite and non-renewable in all cases, regardless of circumstances.
As the utility of human-generated data renewability depends on significant efforts in preprocessing, curation, and ensuring domain-specific relevance, raw data alone is often insufficient for AI and recent research demonstrates that synthetic data alone cannot serve as a replacement.
Therefore the real challenge in AI related to data is not their depletion but the scarcity of useful, quality data for specific tasks. This creates bottlenecks that mimic the challenges of depletion but is not similar to it.
Furthermore, what qualifies as ‘useful’ is highly task-specific and varies across domains, as it depends entirely on the context and goals of the AI system. Unlike the absolute scarcity of fossil fuels, the scarcity of useful data is a relative concept, shaped by the requirements of a particular application and the ability to preprocess, curate, or generate task-relevant datasets.
A more apt aphorism might be this: data is the “drinking water” of AI.
Not all data is immediately useful, just as not all water is potable. Raw data, like raw water, must undergo a process of purification to become valuable for AI systems. This purification involves data cleaning to remove noise and errors, labeling to add structure and meaning, and augmentation to enhance diversity and applicability. Only after these steps can data meet the specific quality and relevance standards required for AI applications, much like how water must be treated to become safe and effective for human consumption. This analogy underscores the importance of preparation and refinement in transforming raw data into a resource that fuels AI development.
The real challenge in AI lies not in the renewability of data, which is continuously generated by design, but in transforming this data into useful, quality datasets to address scarcity. This process should include grappling with critical challenges such as identifying and mitigating biases, ensuring fairness, and navigating ethical considerations. Contextual specificity also plays a crucial role, as data that is relevant and useful in one domain may not be suitable for another.
These complexities highlight the need for thoughtful curation, rigorous validation, and adherence to ethical principles in transforming raw data into a reliable foundation for AI systems—a process.
When saying “Data is the fossil fuel of AI, and we used it all!” We make two mistakes: the first is underestimating or forgetting that there are indeed natural resources on which AI depends, which are truly fossil and therefore non-renewable.
The second, arguably even more serious, is rendering invisible what creates the very conditions for the existence of the data AI needs to be trained on: humans.
Data does not exist independently of human actions, decisions, or systems. Whether it’s generated through explicit activities (e.g., social media posts) or implicitly (e.g., sensor data), humans are directly or indirectly responsible for creating the conditions for data generation. Since data originates from human activity, its existence and utility are contingent upon human inputs, creativity, and labor.
As long as humans exist, so will data, including quality data.
Data is not the ‘fossil fuel’ of AI, natural resources are, and hopefully human life won’t be.
By definition, AI did not drink all the data, might not, but more importantly, should not, because of the natural resources that set a sustainable limitation we must take care of, to put Artificial Integrity over any Intelligence.