Test More Efficiently when Using Real Data

There comes a time during most project testing when it is decided to process a copy of live data through the test system to see how the system functions with real data. Typically, this testing occurs at the end of functional testing, when the system is believed to be functioning as designed. This is not for performance testing purposes; it is to see if the system might have some unforeseen defect that may have been missed during functional testing and which might only be apparent processing live data. As such this is a very valid and useful test exercise, though not straightforward to achieve with good test effectiveness. Although using real data can be very useful for testing, this brings with it data privacy aspects that must be considered, and we’ll come back to that aspect later.

The live data test approach is usually one of:

a. Select a sample of live data to process; the problem with this approach is how do you ensure the sample is representative of all the variations that may be present in the live data. Some key attribute of the live data might still be missed in a sample approach.  


b. Process a copy of all the live data, but if the data set is huge, then the test environment may not have the physical resources to handle all the data. Even if a large data set can be processed, this approach is inefficient and if an error is encountered, having a large data set may make it more difficult to identify exactly what data aspect caused the error.

A much more efficient and effective approach to creating a live data test set is to use a copy of the complete live data set as the source of the test data so that no data content is excluded, but rather than take a sample of this data, specifically select just the significant variations of all the key attributes to construct a small data set that still tests every significant variation of the source data.

Data Profiler and Test Data Selector

SQA’s Data Profiler tool provides a comprehensive data analysis of a file of data, reporting on all the different variations of data present in all the fields of all the records in the file. Primarily aimed at identifying data quality issues within customer data, the Data Profiler includes a Test Data Selector which can be used to create a test file of selected data from the profiled data. The Data Profiler can easily profile a file of millions of records in a few minutes, so there is no need to sample the data, the Data Profiler reads the complete data set.

An example might best explain what the Data Profiler can do. It can, for example, identify all the different data lengths present in a field, and below is an example where all the different surname values are presented showing how many of each name length are present in the data and a sample value for each length. We see for example the Profiler identified 17,743 records having surnames with a length of 7 characters, and the Test Data Selector can select just one of those records to create a test record which has a surname field length of 7 characters. It will do the same for every other surname length and can do similar for every other field in the data set, so now the Test Data Selector could create a test file with one record for every different field value length in the data set, creating a test file that would comprehensively test field length variations present in the live data, using just one record to test each variant.

Test More Effiently Image 1

This is a very simple example of how the Test Data Selector creates an efficient and very comprehensive test file; there are a wealth of refinements of the data selection that can be used such as:

  • Select every different pattern of data from a field: e.g. x x, x-x, xx/x, xx.x and so on.
  • Select each different character present in a field: alpha, numeric, punctuation, special characters etc.
  • Select one of every specific value for a field: perhaps from a title field, ensuring one of every possible title is represented in the test file.
  • Specific profiling configurations allow data analysis tuned for different data types, from Names, Addresses, Countries, through to Dates, Currencies, Bank Sort Codes and more.

The Test Data Selector is highly customisable enabling the user to configure exactly what data variations are of significance for constructing a test file.

When the Test Data Selector creates a test file, alongside this it also creates a separate control file which describes the specific reason why each record was selected/ included in the test file. So, you are no longer using simply a random sample of live data for test purposes but have a well-defined test file with a control sheet that describes exactly what data variant each record is designed to test.

Data Obfuscation

When using live data, data privacy concerns must be addressed. Test Data Selector has an obfuscation feature that can be configured separately for each field. Obfuscation completely randomises the source data while retaining the fundamental data structure for testing purposes. For example, obfuscating a personal name might change JON HYDE-SMITH to ZAJ QKFI-HGOZX, so the name structure of consonants, vowels and punctuation is retained but otherwise the original customer name has been completely obfuscated. Obfuscation can be applied to any data but is most applicable for customer name, address, and account data.

Data Aging

To represent the movement of data over time, Test Data Selector can use data aging to move dates forwards or backwards in time by altering the day, month, or year, or applying a random aging factor.

Conclusion: make the best use of your own data to test an application

The above is a brief introduction to using the SQA Test Data Selector to create powerful test files based on real data while taking due consideration to data privacy concerns when using copies of live data.

If you would like to learn more about this powerful testing tool, and how SQA Consulting may assist you in such needs please contact us.

Get In Touch

Technology Consulting Partners