NVIDIA took a historic step in the field of artificial intelligence and announced a huge English AI training database called Nemotron-CC. The new database contains a total of 6.3 trillion tokens, of which 1.9 trillion are synthetic data. NVIDIA stated that this new database is one of the most comprehensive resources ever developed for training large language models (LLM). The company stated that this innovation will make a big difference, especially in academic and commercial fields. Here are the details…
NVIDIA introduced the Nemotron-CC model, an artificial intelligence education database with 6.3 trillion tokens
During the development process of the Nemotron-CC database, Common Crawl It was reported that a large amount of data received from the platform was used. These data were subjected to a stringent data processing and filtering process to create a high-quality subset, Nemotron-CC-HQ. NVIDIA states that this database is “An ideal training material for large language models” he says.
In fact, this innovation is expected to provide a solution to the limitations faced by existing educational databases in terms of scale and quality. It will offer superior performance, especially compared to leading open source databases such as Deep Common Crawl Language Model (DCLM). NVIDIA announced that models trained with Nemotron-CC provided notable improvements in various tests. For example:
- An increase of 5.6 points was achieved in MMLU (Massive Multitask Language Understanding) tests compared to existing systems.
- Models with 80 billion parameters improved by 5 points in MMLU tests and 3.1 points in ARC-Challenge tests.
- It was stated that Nemotron-CC provided an average performance increase of 0.5 points across 10 different tasks compared to other high-quality databases.
Judging by the results, we clearly see what impact Nemotron-CC can make on the training and capabilities of large language models. However, NVIDIA announced that it used techniques such as model classifiers and synthetic data rephrasing in the development of Nemotron-CC. These techniques have been used to increase the diversity and quality of data in the database. Additionally, the number of high-quality tokens has been increased by easing the strict rules in traditional data filtering methods.
NVIDIA made Nemotron-CC available on the Common Crawl platform and announced that it will soon publish the documentation of this database on the company’s GitHub page. In this way, both academics and commercial users will be able to use this database easily. To the new database from here you can access.
So, what do you think will be the impact of this innovation on the future of artificial intelligence technologies? You can share your opinions in the comments section below…
Source link: https://shiftdelete.net/nvidiadan-6-3-trilyon-tokenli-veritabani-nemotron-cc