What's the Best Method for Generating Test Data?
All data contains secret advantages for your
business. You can unlock them through analysis, and they can lead to cost
savings, increased sales, a better understanding of your customers and their
needs, and myriad other benefits.
Unfortunately, sometimes bad test data can lead
companies astray. For example, IBM estimates that problems resolved during the
scoping phase are 15 times less costly to fix than those
that make it to production. Getting your test data right is essential to
keeping costs low and avoiding unforced mistakes. Here’s what you need to
know about creating test data to ensure your business is on the right path.
What Makes a Test Data Generation Method Good?
While all data-driven business decisions require
good analysis to be effective, good analysis of bad data provides bad results.
So, the best test data generation method will be the one that consistently and
efficiently produces good data on which you can run your analysis within the
context of your business. To ensure that analysis is based on good data,
companies should consider the speed, compliance, safety, accuracy, and
representation of the various methods to ensure they’re using the best method
for their needs.
Safety
Companies often hold more personal data than many
customers realize and keeping that data safe is an important moral duty.
However, test data generation methods are rarely neutral when it comes to
safety. They generally either make personal data less safe, or they make it
safer.
Compliance
Each year, governments pass new data protection
laws. If the moral duty to keep data secure wasn’t enough of an incentive,
there are fines, lawsuits, and in some countries, prison time, awaiting
companies that don’t protect user data and comply with all relevant
legislation.
Speed
If you or your analysts are waiting on the test
data to generate, you’re losing time that could be spent on the analysis
itself. Slow data generation can also result in a general unwilling to work
with either the most recent or representative historical data, which lowers the
potential and quality of your analysis.
Accuracy and
representation
While one might expect that all test data
generation methods would result in accurate and representative data, that’s not
the case. Methods vary in accuracy, and some can ultimately produce data that
bears little resemblance to the truth. In those situations, your analysis can
be done faithfully, but the underlying errors in your data can lead you astray.
Test Data Generation Methods
By comparing different methods of test data through
the lens of these four categories, we can get a feel for the scenarios in which
each technique would succeed or struggle and determine which approaches would
be best for most companies.
Database Cloning
The oldest method of generating test data on our
list is database cloning. The name pretty
much gives away how it works: You take a database and copy it. Once you’ve made
a copy, you run your analysis on the copy, knowing that any changes you make
won’t affect the original data.
Unfortunately, this method has several
shortcomings. For one, it does nothing to secure personal data in the original
database. Running analysis can create risks for your users and sometimes get
your company into legal trouble.
It also tends to suffer from speed issues. The
bigger your database, the longer it takes to create a copy. Plus, edge cases
may be under- or over-represented or even absent from your data, obscuring your
results. While this was once the way companies generated test data, given its
shortcomings, it’s a good thing that there are better alternatives.
Database
Virtualization
Database virtualization isn’t a
technique solely for creating test data, but it makes the process far easier
than using database cloning alone. Virtualized databases are unshackled from
their physical hardware, making working with the underlying data extremely
fast. Unfortunately, outside of its faster speed, it has all the same
shortcomings as database cloning: It does nothing on its own to secure user
data, and your tests can only be run on your data, whether it’s representative
or not.
Data Subsetting
Data subsetting fixes some of the issues found in
the previous approaches by taking a limited sample or “subset” of the original
database. Because you’re working with a smaller sample, it will tend to be
faster, and sometimes using a selection instead of the full dataset can help
reduce errors related to edge cases. Still, when using this method, you’re
trading speed for representativeness, and there’s still nothing being done to
ensure that personal data is protected, which is just asking for trouble.
Anonymization
Anonymization fixes the issue with privacy that
pervades the previous approaches. And while it’s not a solution for test data
generation on its own, it pairs nicely with other approaches. When data is
anonymized, individual data points are replaced to protect data that could be
used to identify the user who originated the data. This approach makes the data
safer to use, especially if you’re sending it outside the company or the
country for analysis.
Unfortunately, anonymization has a fatal flaw: The
more anonymized the dataset is, the weaker the connection between data points.
Too much anonymization will create a dataset that is useless for analysis. Of
course, you could opt for less anonymization within a dataset, but then you
risk reidentification if the data ever gets out. What’s a company to do?
Synthetic Data
Synthetic data is a surprisingly good solution to
most issues with other test data approaches. Like anonymization, it replaces
data to secure the underlying personally identifiable information. However, instead
of doing it point by point, it works holistically, preserving the individual
connections between data while changing the data itself in a
way that can’t be reversed.
That approach gives a lot of advantages. User
privacy is protected. Synthetic datasets can be far smaller than the original
ones they were generated from, but still represent the whole, giving speed
advantages. And, it works well when there’s not a ton of data to be used,
either, helping companies run analysis at earlier stages in the process.
Of course, it’s far more complex than other methods
and can be challenging to implement. The good news is that companies don’t have
to implement synthetic data on their own.
Which Test Generation Data Method is Best?
The best method for a company will vary based on
its needs, but based on the relative performance of each approach, most
companies will benefit from using synthetic data as their primary test data
generation method. Mage’s approach to synthetic data can be implemented in an
agent or agentless manner, meeting your data where it lives instead of
shoehorning a solution that slows everything down. And while it maintains the
statistical relationships you need for meaningful analysis, you can also add
noise to its synthetic datasets, allowing you to discover new edge cases and
opportunities, even if they’ve never appeared in your data before.
But that’s not all Mage can do. Between its static and dynamic masking, it can protect
user data when it’s in production use and at rest. Plus, its role-based access
controls and automation make tracking use simple. Mage is more than
just a tool for solving your test data generation problems—it’s a one-platform
approach to all your data privacy and security needs. Contact us today to see what
Mage can do for your organization.
Tags : #cybersecurity
#dataprivacy #databreach
Source link : https://magedata.ai/whats-the-best-method-for-generating-test-data/
Comments
Post a Comment