Data Labeling AI Tools

Part
01
of one
Part
01

Data Labeling AI Tools

Based on a market report identifying key players in data labeling for AI and a list of "best" tools, we have identified Melissa Data, Paxata, Trifacta, CloudFactory, iMerit, LabelImg, Spare5, Appen, Supervise.ly, RectLabel, and Prodigy as among the top tools. Below is an overview of each.

MELISSA DATA


PAXATA


TRIFACTA

  • Trifacta publishes software tools to help "individuals and organizations more efficiently explore, transform and join together diverse data for analysis." Their primary product is an AI tool for "data wrangling," to use their phrase.
  • Crunchbase page
  • Trifacta's AI discovers what is in an existing body of data, structures it into well-structured databases, cleans out distorting data (e.g., null values), enriches the scope of the user's analysis, validates the data quality, and publishes the results of this "data wrangling."
  • Screenshots

CLOUDFACTORY

  • In addition to their machine learning (AI), CloudFactory provides "flexible workforce solutions" to process large volumes of data, including providing the necessary human "CloudWorkers" to provide the clean data necessary to train an AI algorithm.
  • Crunchbase page
  • CloudFactory's team can determine spatial relationship and train AIs to recognize objects, tagging and annotating images and videos, "all to prepare datasets for smarter natural language processing algorithms across the automation ecosystem."
  • Due to the fact that CloudFactory provides data sets to companies in order to train AIs, screenshots of their platform are not provided on the site.

IMERIT

  • Like CloudFactory, iMerit publishes clean datasets to enable the training of an AI system rather than publishing the system themselves.
  • Crunchbase page
  • iMerit has a dataset of over 100 million images which are segmented and annotated as well as a language dataset with over 8 million annotated data points.
  • Due to the fact that iMerit provides data sets to companies in order to train AIs, screenshots of their platform are not provided on the site.

SPARE5

  • Spare5 provides datasets for its clients which are cleaned, tagged, and annotated via crowdsourcing through the company's app.
  • Crunchbase page
  • Spare5 can provide language assessments (e.g., tone, meaning, and contents of a recorded conversation), annotated images, isolated image elements, and keywords in datasets to allow for the training of AI algorithms.
  • Screenshots

APPEN

  • Appen has a wide-ranging catalog of products and services mostly centered around linguistics and translation, but also including eCommerce, machine translation, search relevance, social media analytics, and more.
  • Crunchbase page
  • Appen can collect, annotate, and tag "high volumes of image, text, speech, audio, and video data" for training AIs using their 1 million person workforce around the world.
  • Due to the fact that Appen provides data sets to companies in order to train AIs, screenshots of their platform are not provided on the site.

SUPERVISE.LY

  • Supervise.ly provides "best-in-class data labeling tools" to allow their clients to create their own datasets for training their AI, as well as a ready-made infrastructure that easily integrates into an enterprise's existing technology stack.
  • Crunchbase page
  • Supervisor.ly provides SmartTool, an AI-powered annotation tool that claims to speed up the annotation process by a factor of twenty. SmartTool includes a team workflow management tool and the ability to "train and test custom neural networks" on the client's own data.
  • Screenshots

RECTLABEL


PRODIGY

  • Prodigy offers only one service, which is its annotation tool, which it claims can "train a new AI model in hours" in "entity recognition, intent detection or image classification." It's selling point is the ability to quickly and easily annotate data for testing purposes.
  • Prodigy does not appear to have a Crunchbase page (note: due to the number of companies named "Prodigy," we attempted to locate the Crunchbase page by the company's URL) and may have been privately funded.
  • Prodigy "is a fully scriptable annotation tool, letting you automate as much as possible with custom rule-based logic," which learns from the user as they annotate and label data and extrapolates from the user's patterns. Features include named entity recognition, text classification, computer vision, and a/b evaluation.
  • Screenshot of live demo.

RESEARCH STRATEGY

Since attempting to evaluate the myriads of AI labeling / annotation tools available would take us far outside of the scope of a single Wonder request, we focused our research on finding existing lists of "best" or "top" tools in this sector. We found two such lists, but as both were from companies working on their own AI solutions, were concerned about their possible bias and the fact that they barely overlapped.

However, in the course of our research, we found several referenced to a market report by Cobnilytica published this year. While such reports are proprietary and usually expensive to obtain, the abstracts and table of contents which are published often include lists of key players, and this proved to be the case here. One of those key players is Figure Eight, which is consistent with what might be expected from the report criteria, which left us with five candidates.

To reach our list of ten, we returned to one of the "best of" lists that we originally found; specifically, a list by Lionbridge (which offers its own AI solution) which contained several of the tools mentioned in the report criteria and so, we judged, is more credible than the other. Since the list is not arranged in alphabetical order and has several of the tools mentioned in our criteria towards the top of the list, we understand the author, as an industry insider, to have ordered them according to her assessment of their utility. However, given that Lionbridge only entered the AI field last month, we have excepted it from our list and understand its prominent inclusion in this list to be a matter of self-serving content marketing. We also eliminated free tools published on GitHub and similar platforms, as these would not have a Crunchbase profile.

Note that some companies provided, like the Figure Eight example in the project criteria, provide clean datasets to make training an AI algorithm possible. As such, screenshots of the system are not always available.
Sources
Sources