DataSet Analysis

of eight

NLP Dataset Trends

NLP has seen an explosion of research and innovation throughout 2019. As 2020 starts, three trends that are effecting NLP datasets are the use of unsupervised or unlabeled data, utilizing more efficient samples from larger datasets, and combining more datasets to process multiple languages.

Use of Unsupervised / Unlabeled Data

  • Supervised data is expensive and requires time and attention for labeling. To combat these costs, more unsupervised or unlabeled data is being used to supplement unsupervised training.
  • This type of NLP dataset training allows for efficient learning for translating between a pre-trained language and a language where a model hasn't previously been created and trained on. Providing the unsupervised data, although noisy, will allow for unsupervised word translation between two languages.
  • This methodology also allows for the learning of the underlying nuances of a language.
  • Zero shot models, or no training at all, are also being utilized more and more.
  • This trend has formed because it is expensive and time-consuming to label data and datasets are growing in size. Being able to process unlabeled and unsupervised data in a meaningful way utilizing pre-trained models saves on both financial costs and time.
  • Lexalytics comments that it would be impossible to train a model with a sentiment score for every word in every possible context.
  • The company is using its Lexalytics Concept Matrix to apply unsupervised learning techniques on the top articles found on Wikipedia. It also has launched an "unsupervised matrix factorization applied to a massive corpus of content (many billions of sentences)" called Syntax Matrix that allows for the parsing of a sentence.

Larger Datasets and More Sample Efficiency Tests

  • NLP expert Sudalai Rajkumar states that it should be expected over 2020 that "much bigger deep learning models trained on much larger datasets to get state-of-the-art results" will be the norm.
  • Sebastian Ruder, another NLP expert, stated: "Rather than just learning from huge datasets, we’ll see more models being evaluated in terms of their sample efficiency, how well they can learn with tens or hundreds of examples. We’ll see an increasing emphasis of sparsity and efficiency in models."
  • Similar to the first trend of using unsupervised data, the larger the datasets and models become, the more processing power and compute time are required. Thus, this trend is developing in order to more efficiently utilize the large datasets being processed.
  • The utilization of skip-thought vectors may in time allow for unsupervised predictions of the next paragraph or chapter of a book or in deciding which chapter should come next. At the current time, however, these processes are too sample-inefficient to be practical.
  • Rare Technologies has created an API to allow for easier use and access the ever-growing number and size of NLP datasets and modules.
  • Stanford University has compiled a database of large datasets for the research and analysis of social networks.

More Datasets for Multiple Languages

of eight

NLP Trends

Natural language processing (NLP) is a burgeoning field with virtually limitless potential. Three areas of particular focus for researchers within this field are natural language understanding (NLU) and associated processes like conversational AI, machine translation, and semantic searching. These trends, as well as applications of each, are discussed below, followed by an explanation of the research strategy employed.

Natural Language Understanding (NLU) and Conversational AI

  • NLU is a subfield of NLP "that is based on machine reading comprehension." It allows machines to deconstruct and analyze "the elemental pieces of human speech or a given text" to interpret language and deduce intent.
  • NLU is distinguished from general NLP in that, when employing NLU, "computers can deduce what a speaker actually means, and not just the words they say."
  • NLU "provides an understanding of how terms are used in context for situations involving sarcasm, irony, sentiment, humor, colloquialisms, and others," allowing machines to glean a more holistic understanding of human language than would otherwise be possible simply by interpreting the meaning of words and syntax.
  • Similarly, a conversational AI is one that can carry out conversations with humans, and was a major area of research demonstrated at the 2019 meeting of the Association for Computational Linguistics (ACL).
  • Researchers working in this field hope to develop NLU and conversational AI "to the point where conversing with a machine is as simple as conversing with a human."
  • A notable application of NLU is in Amazon's Alexa. Any device enabled with Alexa technology, such as the Amazon Echo, can host NLU applications (called "skills" in Amazon lingo) to more comprehensively understand the questions and directions of its users.
  • Amazon further elucidates the application of NLU as follows: "Before NLU, designing a weather app with voice as an input would require a list of a thousand ways we could ask 'is it raining.' With NLU, Alexa-enabled devices like Amazon Echo can apply learnings from historical interactions, across thousands of diverse applications, to understand that 'is it raining outside' and 'is it going to rain' are essentially the same question."
  • IBM has also applied NLU to its Watson AI system, offering it as a cloud-based product that performs "advanced text analytics."

Machine Translation

  • As NLP and associated fields develop, one of the key focus areas is multilingual translation. Its importance in a variety of areas is the reason so many NLP researchers are focused on it, and some predict that machines will largely replace human-based translation services "within the next 5-10 years."
  • While some machine translation services exist, the neural models employed in them often miss words, and "still have difficulties with handling abbreviations and named entities."
  • Potential NLP solutions to these problems "involve more context modeling and data de-biasing, as well as leveraging multi-task learning and human knowledge to strengthen the models further."
  • Some researchers, such as the Stanford NLP Group, are exploring "techniques that utilize both statistical methods and deep linguistic analyses" to improve machine translation.
  • One notable application of NLP-based machine translation is in Linguistic Systems, Incorporated (LSI)'s AI Translate product, which is utilized by professionals across a range of technical fields to "extract the precise meaning of foreign language content."
  • Another application of which the general public is likely more aware is Microsoft's translation software, most recently popularized by a holiday commercial showing a young girl talking to a reindeer using a Microsoft product. The software is reportedly going to be utilized by Amazon and Google.
  • Earlier applications of machine translation (albeit less advanced than newer iterations are expected to be, and often rife with the problems mentioned above) include translation services currently offered by Google and Facebook.

Semantic Search

  • Semantic search is an NLP process using both standard NLP and NLU to "quickly process multiple documents to define specific terms, insights, and requirements."
  • The power of semantic search is that it "can search content for its meaning in addition to keywords, and maximize the chances the user will find the information they are looking for." Thus, accurate search results may appear even if the specific keywords do not appear in the results.
  • While not new, semantic search is an increasingly notable trend in the NLP space primarily because more businesses are beginning to recognize its far-reaching applicability. As one expert explains, "all of a sudden now it’s jumping forward because you can see how semantic search makes your data accessible to more business users."
  • As semantic searching improves, "it will no longer be necessary to spend a lot of time on information searches and analytical research as this process gets reduced from hours to minutes," such that data-heavy industries including "insurance, banking, healthcare, pharmaceuticals," and others will become substantially more productive.
  • While most searches on Google and other search engines are simple keyword searches, Google and Bing have already begun employing semantic searches in specific contexts, like when searching for restaurants or other establishments.
  • A more specialized example of semantic search is the search engine developed by Azati for a company in the in vitro diagnostics (IVD) and biopharmaceutical space.
  • The company utilized the engine's semantic capabilities to analyze an inventory that "included a considerable number of blood samples. Each blood sample was described using several tags, grouped into subcategories, which were grouped into larger categories, etc." Simple keyword searches were insufficient to cull specific results from such a substantial and complex inventory, and semantic search improved the process greatly.

Research strategy

To compile the above list of trends, we surveyed a variety of industry publications describing various trends in NLP. We synthesized the data in these reports by selecting trends identified by multiple publications. We then sought additional information regarding these trends, including various use cases for each.
of eight

NLP Datasets Academic Papers

The area of data quality (and data quality assurance) and natural language processing has been explored by many scholars. Below are six papers or studies on this topic.

"Solution of Creating Large Data Resources in Natural Language Processing"

  • An abstract of this paper is available here.
  • The paper focuses on how resources can be unified to make larger ones with a common structure and format to promote the highest of data quality.

"A Data Element-Function Conceptual Model for Data Quality Checks"

  • This paper is available here.
  • This paper is about data quality assurance and how a conceptual model of data quality checks facilitates categorization and indexing. Researchers applied NLP techniques on several thousand checks.

"Best Practices Framework for Improving Maintenance Data Quality to Enable Asset Performance Analytics"

  • This paper, presented at the Annual Conference of the PHM Society, is available in full here.
  • In it, the authors establish a best practices framework for improving data quality using NLP for missing or inconsistent data.

"Assessing the Quality of Natural Language Text Data"

  • A PDF of this paper is available here.
  • This paper discusses how to measure data quality for NLP. The authors conclude that, "apart from text accessibility, today only representational text quality metrics can be derived and computed automatically."

"NLP Data Cleansing Based on Linguistic Ontology Constraints"

  • An abstract of this paper is available here.
  • In it, the authors explore the problem of linked data and assessing its quality. They use a certain quality assessment to conduct NLP data quality assessments.

"Data Quality Centric Application Framework for Big Data"

  • A full PDF of this paper is available here.
  • This research discusses the larger issue of data quality in "Big Data", which includes NLP. The authors examine the risks and consequences surrounding data quality and propose a central framework for its assurance.

of eight

NLP Academic Papers

The requested information on natural language processing (NLP) is presented below.

Article 1

  • A direct link to "Natural language processing (NLP) in qualitative public health research: A proof of concept study" can be found here.

Article 2

  • The researchers employed a systematic approach to identify current clinical NLP systems that create structured information from unstructured free text. Numerous literature databases were searched with a query that combined the concepts of NLP and structured data capture.
  • The review found many NLP systems with capability to process clinical free text and create a structured output. The systems address a broad spectrum of important clinical and research tasks. The findings of the article are vital in prioritizing development of novel approaches for clinical NLP.

Article 3

  • A direct link to "Natural Language Processing: State of The Art, Current Trends and Challenges" can be found here.
  • The paper sought to distinguish four phases by exploring different levels of NLP. The authors also define NLP in details, creating an image of how it facilitates communication between humans and machines, specifically computers.
  • It gives the history of NLP, indicating that it has gained importance in different fields, including medical, information extraction, text categorization, spam filtering, and machine translation. The article describes how NLP is applied in different fields and the current trends and challenges.

Article 4

  • The author presents the reader with statistics on the volumes of digitized data that is exchanged daily to give the reader the idea of the size of textual data that can be analyzed using NLP. The author indicates that such information carries insights that can be extracted to help in making important decisions in any business or industry. He indicates that NLP helps in extracting or compiling such information.
  • The author takes the reader through how he works with NLP, indicating that NLP is a huge space within artificial intelligence (AI). He notes that enterprises are currently incorporating NLP technologies into their existing platforms more every day. The blog indicates that numerous kinds of information can be compiled from textual data, and businesses can leverage it to analyze consumer behavior and enhance internal effectiveness.

Article 5

  • A direct link to "The effects of natural language processing on big data analysis: sentiment analysis case study" can be found here.
  • The article states that social media platforms are key sources of big data. They generate huge volumes of different types data at high rates. The data contains information that needs effective and scalable analysis methods to compile.
  • NLP is considered one of the effective techniques for preprocessing datasets. The paper examines the NLP technique to verify its effects on the quality of big data classification. The study found that this preprocessing technique achieves an improvement in the classification accuracy of the Naïve Bayes algorithm.
  • The researcher found that the technique and linguistic processing enhances the performance of the sentiment analysis by 5%. The results yielded an accuracy of 73% on the used dataset. The findings indicate that NLP provides high accuracy of the data extracted.
of eight

France NLP Data Vendors

Most datasets for NLP purposes are freely available, such as that from Mozilla, but Lionbridge and Appen are two other prominent vendors offering annotated and curated data also for purchase.


  • Lionbridge is an international company providing datasets in more than 300 languages. They have over 500,000 analysts worldwide to help provide custom data that is highly cultivated and useful for NLP training.
  • The company has at least two offices in France.


  • Appen has speech data in many languages, including French from France.
  • These are purchasable datasets.


  • Mozilla is building a large repository of voice data for NLP and machine-learning purposes.
  • Their French files has over 184 hours of recordings with 3,005 voices. It is 64% Francophone speakers from France.
  • The data is available to use under the Creative Commons license.

Research Strategy

Finding dataset vendors was not straightforward. Overall, it seems that most companies or people use freely available datasets, then outsourcing the actual annotation or processing of the data. There is a huge amount of data freely available, including in the French language. This website lists 20 different large data sources in French. The other issue is that when we could find dataset vendors, none of them appeared to be in French or did not specify any information about languages. However, after extensive research, we were able to find that Lionbridge and Appen offering French language audio data for sale, as this data has been annotated, verified and/or curated specially for machine-learning purposes. Additionally, we also include Mozilla as a provider, since its French audio dataset contains a robust amount of data also for use. A few other smaller audio sources in French include VoxForge and Ortolang.

As a note, datasets are generally organized by language rather than country.
of eight

France NLP Data Vendors Competitive Landscape 1

Data regarding the natural language processing (NLP) data set offering of Lionbridge has been compiled in rows 3-16, column B of the spreadsheet. Lionbridge only delivers customized projects based on the clients wishes and for this, the company uses its database of 10,000 hours of recordings.


  • According to the company website, the company currently owns "over 5,000 unique conversations from a thousand of global contributors in 16 target languages."
  • According to a case study published on the company website, the work the company has done before included a database that covered 10,000 hours of recordings.
  • Customers listed on Lionbridge website include Traveloka, Expedia, NKK, Line, and CrowdWorks.
of eight

France NLP Data Vendors Competitive Landscape 2

Data regarding the natural language processing (NLP) data set offering of Mozilla has been compiled in rows 3-16, column C of the spreadsheet. Mozilla offers an open database of more than 2,454 hours for over 50 different world languages.

Mozilla Common Voices

  • Common Voices, Mozilla's natural language processing (NLP) data set offering, operates under the Creative Commons Attribution Share-Alike 3.0 Unported license.
  • The audience for this data set spans across all industries as well as private customers due to the data set being offered free of charge.
  • Advertised customers include Mycroft, Leon, FusionPBX, and Mozilla.
of eight

France NLP Data Vendors Competitive Landscape 3

Appen provides six French (from France) datasets for sale. All information has been added to the spreadsheet, as requested.

Selected Findings

  • The French (France) audio datasets are: Telephony (2), in-car, conversational, conversational telephony, microphone and a pronunciation lexicon.
  • Appen does not disclose their pricing publicly. Interested parties must contact them directly for pricing information.
  • The licensing process is not detailed, but Appen states its datasets are "licensable".
  • They service companies in retail, technology, automotive, healthcare, government and financial services.
  • The customers are not advertised on the datasets. Their overall testimonials section also does not disclose client names.
  • The file type and encoding information was not available for any of the French (France) datasets.
  • Technical and specific data about each of the 6 datasets offered for sale (though one is just an accompanying lexicon) is available in the spreadsheet.