Databricks introduces new API for generating synthetic datasets

Databricks Inc. today introduced an application programming interface that customers can use to generate synthetic data for their machine learning projects.

The API is available in Mosaic AI Agent Evaluation, a tool that the company offers as part of its flagship data lakehouse. The tool helps developers compare the output quality, cost and latency of artificial intelligence applications. Mosaic AI Agent Evaluation rolled out in June alongside Mosaic AI Agent Framework, which eases the task of implementing retrieval-augmented generation.

Synthetic data is information generated with the help of an AI for the sole purpose of neural network development. Creating training datasets in this manner is considerably faster and more cost-efficient than assembling them manually. Databricks’ new API is geared towards generating question and answer collections, which are useful for developing applications powered by large language models.

Creating a dataset with the API is a three-step process.

Developers must first upload a frame, or file collection, with business information relevant to the task their AI application will perform. Frames must be in a format supported by Apache Spark or Pandas. Spark is the open-source data processing engine that underpins Databricks’ platform, while Pandas is a popular analytics tool for the Python programming language.

After uploading the sample data, developers must specify the number of questions and answers the API should generate. They can optionally provide additional instructions to customize the API’s output. A software team may specify the style in which the questions should be generated, the task for which they will be used and the end-users who will interact with the AI application.

Inaccurate training data can reduce the quality of an AI model’s output. As a result, companies often have subject matter experts review a synthetic dataset for errors before feeding it to a neural network. Databricks says it developed the API in a manner that eases this part of the workflow.

“Importantly, the generated synthetic answer is a set of facts that are required to answer the question rather than a response written by the LLM,” Databricks engineers detailed in a blog post today. “This approach has the distinct benefit of making it faster for an SME to review and edit these facts vs. a full, generated response.”

Databricks plans to release several enhancements for the API early next year. A new graphical interface will enable dataset reviewers to more quickly check question-answer pairs for errors and add more pairs if necessary. Additionally, Databricks will add a tool for tracking how a company’s synthetic datasets change over time.

Image: Unsplash

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU