Data curation (also called preparation, wrangling, or cleaning) is a critical stage in data science in which raw data is structured, validated, and repaired. Data validation and repair establish trust in analytical results, while appropriate structuring streamlines analytics. A new collaboration between the University at Buffalo, New York University, and the Illinois Institute of Technology is looking to build software to streamline this process, making it easier and faster to explore and analyze raw data.
Our goal is to make curation more...
Our tool, Vizier, will combine a simple "notebook-style" interface based on JuPyTer with powerful back-end tools that track changes, edits, and the effects of automation. These forms of "provenance" capture both the exploratory curation process---how the cleaning workflows evolve, and how data changes over time. By connecting these different types of provenance, Vizier will not only support the auditing of curation processes, but also explain the context in which they were applied, making it faster and easier to curate data.
Vizier enables worry-free exploration. A simple notebook interface mirrors a spreadsheet view of your data, tracking the provenance of your edits. Provenance is at the heart of Vizier, making it easy to undo and redo actions and allowing Vizier to suggest new curation steps, visualizations, or to make guesses about your data. Finally provenance allows you to develop curation workflows on small data sets and then seamlessly deploy them to larger datasets (e.g., via Spark or Hadoop)