Tonic.ai, the San Francisco-based company pioneering data synthesis solutions for software and AI developers, today announced the launch of the world’s first secure data lakehouse for LLMs, Tonic Textual, to enable AI developers to seamlessly and securely leverage unstructured data for retrieval-augmented generation (RAG) systems and large language model (LLM) fine-tuning. Tonic Textual is an all-in-one data platform designed to eliminate integration and privacy challenges ahead of RAG ingestion or LLM training—two of the biggest bottlenecks hindering enterprise AI adoption. Leveraging its expertise in data management and realistic synthesis, Tonic.ai has developed a solution to tame and protect siloed, messy, and complex unstructured data into AI-ready formats ahead of embedding, fine-tuning, or vector database ingestion.
The Untapped Value of Unstructured Data
Enterprises are rapidly expanding investments in generative AI initiatives across their businesses, motivated by its transformational potential. Optimal deployments of the technology must leverage enterprises’ proprietary data, often stored in messy unstructured formats across various file types and containing sensitive information about customers, employees, and business secrets. The IDC estimates that approximately 90% of data generated by enterprises is unstructured, and, in 2023 alone, organizations were expected to generate upwards of 73,000 exabytes of unstructured data. To use unstructured data for AI initiatives, it must be extracted from siloed locations and standardized, a time-consuming process that monopolizes developer time. According to a 2023 IDC survey, 50% of companies have mostly or completely siloed unstructured data, and 40% of companies are still manually extracting information from the data.
“We’ve heard time and again from our enterprise customers that building scalable, secure unstructured data pipelines is a major blocker to releasing generative AI applications into production,” said Adam Kamor, Co-Founder and Head of Engineering, Tonic.ai. “Textual is specifically architected to meet the complexity, scale, and privacy demands of enterprise unstructured data and allows developers to spend more time on data science and less on data preparation, securely.”
The Importance of Privacy in AI
Particularly when using third-party model services, data privacy is paramount among enterprise decision makers—the same IDC survey reported that 46% of companies cite data privacy compliance as a top challenge in leveraging proprietary unstructured data in AI systems. Organizations must protect sensitive information in the data from model memorization and accidental exfiltration, or risk costly compliance violations.
“AI data privacy is a challenge the Tonic.ai team is uniquely positioned to solve due to their deep experience building privacy-preserving synthetic data solutions,” said George Mathew, Managing Director at Insight Partners. “As enterprises make inroads implementing AI systems as the backbone of their operations, Tonic.ai has built an innovative product in Textual to supply secured data that protects customer information and enables organizations to leverage AI responsibly.”
Introducing the Secure Data Lakehouse for LLMs
Tonic Textual is a first-of-its-kind data lakehouse for generative AI that can be used to seamlessly extract, govern, enrich, and deploy unstructured data for AI development. With Tonic Textual, you can:
- Build, schedule, and automate unstructured data pipelines that extract and transform data into a standardized format convenient for embedding, ingesting into a vector database, or pre-training and fine-tuning LLMs. Textual supports the leading formats for unstructured free-text data out-of-the-box, including TXT, PDF, CSV, TIFF, JPG, PNG, JSON, DOCX and XLSX.
- Automatically detect, classify, and redact sensitive information in unstructured data, and optionally re-seed redactions with synthetic data to maintain the semantic meaning of your data. Textual leverages proprietary named entity recognition (NER) models trained on a diverse data set spanning domains, formats, and contexts to ensure that sensitive data is identified and protected in any form it may take.
- Enrich your vector database with document metadata and contextual entity tags to improve retrieval speed and context relevance in RAG systems.
Looking ahead, our roadmap includes plans to add capabilities that further simplify building generative AI systems on proprietary data without compromising privacy for utility, including:
- Native SDK integrations with popular embedding models, vector databases, and AI developer platforms to create fully automated, end-to-end data pipelines that fuel AI systems with high-quality, secure data.
- Additional capabilities for data cataloging, data classification, data quality management, data privacy and compliance reporting, and identity and access management to ensure organizations can utilize generative AI responsibly.
- An expanded library of data connectors, including native integrations with cloud data lakes, object stores, cloud storage and file-sharing platforms, and enterprise SaaS applications, enabling AI systems to connect to data across the entire organization.
“Companies have amassed a staggering amount of unstructured data in the cloud over the last two decades; unfortunately, its complexity and the nascency of analytical methods have prevented its use,” said Oren Yunger, Managing Partner at Notable Capital. “Generative AI has finally unlocked the use case for that data, and Tonic.ai has stepped in to solve the complexity problem in a way that reflects its core mission to transform how businesses handle and leverage sensitive data while still enabling developers to do their best work.”
About Tonic.ai
Tonic.ai empowers developers while protecting customer privacy by enabling companies to create safe, synthetic versions of their data for use in software development, testing, and MLOps. Founded in 2018, with offices in San Francisco, Atlanta, New York, and London, the company is pioneering enterprise solutions for data de-identification, subsetting, and synthesis. Thousands of developers use data generated with Tonic.ai’s products on a daily basis to build their products faster in industries as wide ranging as healthcare, financial services, logistics, edtech, and e-commerce. Working with customers like eBay, Walgreens, Texas Capital Bank, and the NHL, Tonic.ai innovates to advance their goal of advocating for the privacy of individuals while enabling companies to do their best work. For more information, visit https://www.tonic.ai or follow @tonicfakedata on Twitter.
Notes to Editors
Source: IDC White Paper, Sponsored by Box Inc., “Untapped Value: What Every Executive Needs to Know About Unstructured Data,” Doc. US51128223, August 2023
View source version on businesswire.com: https://www.businesswire.com/news/home/20240528025080/en/