NLP Dataset Trends
NLP has seen an explosion of research and innovation throughout 2019. As 2020 starts, three trends that are effecting NLP datasets are the use of unsupervised or unlabeled data, utilizing more efficient samples from larger datasets, and combining more datasets to process multiple languages.
Use of Unsupervised / Unlabeled Data
- Supervised data is expensive and requires time and attention for labeling. To combat these costs, more unsupervised or unlabeled data is being used to supplement unsupervised training.
- This type of NLP dataset training allows for efficient learning for translating between a pre-trained language and a language where a model hasn't previously been created and trained on. Providing the unsupervised data, although noisy, will allow for unsupervised word translation between two languages.
- This methodology also allows for the learning of the underlying nuances of a language.
- Zero shot models, or no training at all, are also being utilized more and more.
- This trend has formed because it is expensive and time-consuming to label data and datasets are growing in size. Being able to process unlabeled and unsupervised data in a meaningful way utilizing pre-trained models saves on both financial costs and time.
- Lexalytics comments that it would be impossible to train a model with a sentiment score for every word in every possible context.
- The company is using its Lexalytics Concept Matrix to apply unsupervised learning techniques on the top articles found on Wikipedia. It also has launched an "unsupervised matrix factorization applied to a massive corpus of content (many billions of sentences)" called Syntax Matrix that allows for the parsing of a sentence.
Larger Datasets and More Sample Efficiency Tests
- NLP expert Sudalai Rajkumar states that it should be expected over 2020 that "much bigger deep learning models trained on much larger datasets to get state-of-the-art results" will be the norm.
- Sebastian Ruder, another NLP expert, stated: "Rather than just learning from huge datasets, we’ll see more models being evaluated in terms of their sample efficiency, how well they can learn with tens or hundreds of examples. We’ll see an increasing emphasis of sparsity and efficiency in models."
- Similar to the first trend of using unsupervised data, the larger the datasets and models become, the more processing power and compute time are required. Thus, this trend is developing in order to more efficiently utilize the large datasets being processed.
- The utilization of skip-thought vectors may in time allow for unsupervised predictions of the next paragraph or chapter of a book or in deciding which chapter should come next. At the current time, however, these processes are too sample-inefficient to be practical.
- Rare Technologies has created an API to allow for easier use and access the ever-growing number and size of NLP datasets and modules.
- Stanford University has compiled a database of large datasets for the research and analysis of social networks.
More Datasets for Multiple Languages
- Sebastian Ruder also commented that "we’ll see a focus on multilinguality and more datasets that cover multiple languages" in the coming year.
- This trend is predicated on the previous trends, both having larger datasets to work with as well as utilizing new machine learning techniques and training models.
- The University of Cambridge along with analytics company Lionbridge have been working on cross-lingual NLP language modeling. The ultimate goal is to achieve the ability for automatic translations.
- In German, the Karlsruhe Institute of Technology has been utilizing Wikipedia to make cross-lingual NLP usage from DBPedia a reality rather than a hindrance to researchers.