NLP Data Tools

NLP Data Tools

As part of her PhD, Ashleigh developed a cloud-based web application to enable non-technical users, such as dataset curators, to investigate language datasets for potential use in natural language processing applications.

The main purpose of this tool was to provide interpretable statistics about the demographic biases present in language datasets. This information can help inform dataset curators of issues within their datasets before they are used to train (subsequently biased) AI language models.

The tool also provided a suite of other analysis tools, such as topic modeling, language modeling, sentiment analysis, and word embedding training and visualization. The tool used a Hadoop cluster for large datasets to parallelize computations which could follow the MapReduce framework.