Researchers at Stanford and Meta AI have Developed a Dataset Pruning Technique for Scaling Artificial Intelligence AI Training
In machine learning, the test error decreases as the training data used to build the model increases. These laws are called the neural scaling laws. Usually, it is the power law where the test error falls off as a power law with the training data. Because of this, millions of dollars of Investments are made to collect data. The problem with power law scaling is that massive amounts of more data are required to increase the performance only by a few percentages, so it is unsustainable. In this paper, the researchers have developed a metric to prune the dataset to make the scalability to exponential decay.
The researchers have used statistical mechanics to show that the performance can scale via an exponential decay relationship when the datasets are appropriately pruned. The currently existing pruning techniques are either compute-intensive or poor performance. The new approach uses a self-supervised AI model that estimates a pruning metric with less computation. In a paper published by OpenAI in 2020, it was shown that the model performance shows a power law w.r.t number of parameters, size of the training dataset, and the computing power used. By this law, we would require a substantial additional training dataset to increase accuracy by a few percentages. But for an exponential decay relationship, we would need much less supplementary dataset to achieve a similar performance improvement.
The team of researchers at Meta started by developing a model theoretically which improves performance from data pruning using statistical mechanics. Firstly determine the margin, which is defined as the distance of the data point from the decision boundary. Then this indicates if the training example is easy, which means the margin is large, or hard, which means the margin is less. Now comes the pruning of the dataset part. The researchers determined that for small datasets, it was best to keep the easy examples but for larger datasets keeping the harder examples was better. It was also found that as the initial dataset size increases, the amount of data required to be pruned to achieve an exponential decay increases.
Although large foundation models are trained on unlabeled datasets, the best existing metrics for dataset pruning require a large amount of computational power and labeled datasets, making them unfeasible for pruning training datasets. The Meta researchers created a self-supervised pruning metric to address this problem. The team employed k-means clustering on an embedding space from a trained model to calculate the measure. Each dataset’s distance from the closest cluster centroid serves as the pruning metric.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Beyond neural scaling laws: beating power law scaling via data pruning'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and reference link. Please Don't Forget To Join Our ML Subreddit