Show HN: High-Level Synthetic Data Generation from Verbal Descriptions

(demo.repliclust.org)

3 points by mzelling 16 hours ago | 0 comments

Hi all!

In statistics, synthetic data benchmarks are important for understanding the strengths and limitations of competing algorithms. For example, in clustering – the art of identifying groups of data points that are similar to each other – researchers typically study how algorithms perform on mock scenarios like “five oblong clusters in 2D with some overlap.”

Unfortunately, creating these scenarios typically involves a lot of work. You have to design entire data sets so they match the scenario description. In clustering, this involves selecting cluster centers, tuning covariance matrices, etc. As part of my PhD at Caltech, I have developed a high-level synthetic data generator for clustering that automates this process. You only have to describe your desired scenario in English, and the algorithm takes care of creating data sets with suitable clusters. This means researchers can easily set up benchmarks by passing scenario descriptions as a list of strings.

We have put up a demo here: https://demo.repliclust.org. Curious to hear your thoughts!

Mike