How Reality Defender Sources Data

Reality Defender

Data Engineering Team

Data is the lifeblood of Artificial Intelligence. Chatbots, image generators, and AI-fueled deepfake detection tools all rely on quality data input to maximize their effectiveness.

This is why Reality Defender maintains rigorous standards for the data we use to train and test our deepfake detection models. Sourcing high-quality datasets from ethical sources ensures that our models can detect the widest range of AI markers while upholding our commitment to using AI responsibly.

AI Training Data for Deepfake Detection

Reality Defender's state-of-the-art detection models require two types of data for training: authentic content and generative deepfake content created or manipulated with generative AI. We also acquire licenses for high-quality datasets and strive to ensure we do not train on content that violates creators’ rights.

Deepfake datasets are necessary to make our models learn the difference between authentic data and AI-generated content. As with the authentic data, many of our deepfake datasets are licensed from research and academic sources. Given that the functionality of our deepfake detection models depends on our ability to keep up with the evolving generative AI tools that enable users to create deepfakes, some of our most valuable datasets are those we acquire from popular AI creation platforms. To teach our models to accurately spot a deepfake, we must first study the generation process with as many diversified image, video, audio, and text samples as possible.

Copyright and Bias in AI Training Data

Copyright is a major issue in the world of AI training. Our data engineers perform rigorous audits to ensure our models are not trained on datasets utilizing copyrighted material. Creators have nothing to worry about when it comes to Reality Defender’s output: because our models are set up as classifiers, they are unable to reproduce any of the data that is inserted into them. Unlike generative AI models, which take datasets and are trained to generate content that’s similar to their training datasets (which can and has resulted in the reproduction of items from said training sets), our models utilize datasets to produce a probability score stating how likely it is that the input is AI-generated.

Our AI and data engineering teams then run tests and bias discovery methods to determine if the datasets we use create bias in Reality Defender’s deepfake detection models. We perform slight dataset alterations and pit datapoints against each other to identify substandard performance. From there, we either adjust the models and/or adjust the data (in cases where we find the dataset is not balanced) to balance out bias patterns.

We will continue to reassess how we source training data to ensure our tools provide the most effective deepfake protection on the market while staying true to our ethical commitments. Collaboration between generative AI platforms and companies providing deepfake detection solutions is crucial to curbing the weaponization of deepfakes and ensuring that negative use cases of AI are kept to a minimum. This is why we forge valuable data partnerships with powerful generative AI platforms to bolster our models with the most relevant, diversified datasets possible. Responsible sourcing, proper rights of ownership, and the elimination of biases in model learning are among the top priorities for Reality Defender’s data team.

‍

Insights