Data Scientist DataOpsZone

A Coder Guide to Data Science?

Data Science is an interdisciplinary field that utilizes mathematics, statistics, and computer science to extract meaningful insights from large datasets. It can be used to uncover patterns and solve complex problems in a variety of industries such as healthcare, finance, marketing, and engineering.

 Choosing the right language for a data science project is essential, and there are a variety of languages to choose from. Python, R, SQL, MATLAB, and Scala are some of the best languages for data science, each offering unique features and capabilities that make them suitable for different tasks.

Lets talk about the top 5 languages n more detail.

The What & When of:

  • 1. Python
  • 2. R
  • 3. SQL
  • 4. MATLAB
  • 5. Scala


What is Python?

Python is a high-level, general-purpose programming language that is popular among data scientists for its flexibility, wide range of libraries, and ease of use. Python is used for data analysis, machine learning, web development, and more. It is a great language for beginners as it has a simple syntax and provides a wide range of libraries and modules to help with data manipulation and analysis.

When to choose Python?

Python is a great choice for data science projects that require a lot of data manipulation and analysis. It is also a great choice for projects that have a large and diverse dataset, as its wide range of libraries and modules will make it easier to process and visualize the data. Python is also a great choice for beginners, as it is easy to learn and provides a wide range of resources to help with data analysis.


What is R?

R is a programming language and software environment for statistical computing and graphics. It is popular among data scientists for its powerful statistical analysis capabilities and its wide range of libraries for data manipulation and visualization. R is particularly popular among academics and researchers, who use it to analyze data and build predictive models.

When to use R?

R is a great choice for data science projects that require a lot of statistical analysis. It is also a great choice for projects that require powerful data manipulation and visualization capabilities. R is popular among academics and researchers, so it is a great choice for projects involving research or analysis.


What is SQL?

SQL (Structured Query Language) is a domain-specific language used to interact with databases. It is used to store, retrieve, manipulate, and analyze data stored in a relational database. SQL is popular among data scientists to access and analyze data stored in relational databases, as it is easy to learn and offers powerful features for data analysis.

When to use SQL?

SQL is a great choice for data science projects that involve accessing and analyzing data stored in a relational database. It is also a great choice for projects that require a lot of data manipulation, as SQL offers powerful features for data analysis. SQL is also easy to learn, making it a great choice for beginners.


What is MATLAB?

MATLAB (Matrix Laboratory) is a high-level programming language and environment used for technical computing and data analysis. It is popular among data scientists for its powerful numerical computing and visualization capabilities. MATLAB also has a wide range of libraries for data analysis and machine learning, making it a great choice for data scientists.

When to use MATLAB?

MATLAB is a great choice for data science projects that require a lot of technical computing and visualization. It is also a great choice for projects that require a lot of data manipulation and analysis, as it has a wide range of libraries for data analysis and machine learning. MATLAB is also a great choice for projects involving numerical computing, as it has powerful numerical computing capabilities.


What is Scala?

Scala is a general-purpose programming language that is often used for data science projects. It is a combination of object-oriented and functional programming, and is popular for its powerful features and scalability. Scala is a great choice for data science projects, as it is easy to learn and offers a wide range of libraries for data manipulation and analysis.

When to use Scala?

Scala is a great choice for data science projects that require a lot of data manipulation and analysis. It is also a great choice for projects that require scalability, as it offers powerful features for data manipulation and analysis. Scala is also a great choice for projects that require a lot of object-oriented programming, as it offers a combination of object-oriented and functional programming.

One Size Doesnt Fit All

In many cases, a hybrid approach is best for data science projects. This involves combining the best features of different languages and tools to create a powerful and flexible data science solution. For example, combining Python and R can provide the best of both worlds, with Python providing powerful data manipulation and visualization capabilities, and R providing powerful statistical analysis capabilities.

No matter what language or tools you use, the most important thing is to choose the right ones for your particular project. Finding the right combination of languages and tools to best suit your project can take some experimentation, but it is well worth the effort.

Author Jane Temov

Jane Temov is an IT Environments Evangelist at Enov8, specializing in IT and Test Environment Management, Release and Data Management product design & solutions.

What is Data Cloning

What is Data Cloning? A Beginners Guide

Data Cloning, sometimes called Data Virtualization, is a method of snapshotting real data and creating tiny “fully functional” copies for the purpose of rapid provisioning into your Development & Test Environments.

The Cloning Workflow

There are four primary Steps

  1. Load / Ingest the Source Data
  2. Snapshot the Data
  3. Clone / Replicate the Data
  4. Provision of the Data to DevTest Environments

Under the Hood

Cloning is typically achieved/built using ZFS or HyperV technologies and allows you to move away from the traditional backup & restore methods, which can take hours.

By using ZFS or HyperV you can provision databases x100 quicker and x10 smaller.

What is ZFS?

  • ZFS is a file system that provides for data integrity and Snapshotting. It is available for most if not all major OS platforms.

What is HyperV?

  • HyperV is a Microsoft virtualization platform that can be used to create and manage virtual machines. It supports Snapshotting as well.

Problem Statement

Backups are often taken manually and can take hours or days to complete. This means that the data isn’t available for use during this time period, which can be problematic if you need access to your data immediately.

There is also a secondary issue with storage. A backup & restore is, by its nature, a 100% copy of the original source. So if you started with a 5 TB database and wanted x3 restores then you are up for another 15 TB in disk space.

What are the Benefits of Data Cloning?

Data cloning is the process of creating a copy, or snapshot, of data for backup, analysis, or engineering purposes. This can be done in real-time or as part of a scheduled routine. Data clones can be used to provision new databases and test changes to production systems without affecting the live dataset.


– Clones can be used for development and testing without affecting production data

– Clones use little storage, on average about 40 MB, even if the source was 1 TB

– The Snapshot & Cloning process takes seconds, not hours

– You can restore a Clone to any point in time by bookmarking

– Simplifies your End to End Data Management


– The underlying technology to achieve cloning can be complex.

However, there are various cool tools on the market that remove this complexity.

What Tools are available to support Data Cloning?

In addition to building your own from scratch, commercial cloning solutions include:

Each is powerful and has its own set of features and benefits. The key is to understand your data environment and what you’re trying to achieve before making that final decision.

Common Use Cases for Data Cloning

  • DevOps: Data cloning is the process of creating an exact copy of a dataset. This can be useful for several reasons, such as creating backups or replicating data, into Test Environments, for development and testing purposes.
  • Cloud Migration: Data cloning provides a secure and efficient way to move TB-size datasets from on-premises to the cloud. This technology can create space-efficient data environments needed for testing and cutover rehearsal.
  • Platform Upgrades: A large majority of projects end up going over the set schedule and budget. The primary reason for this is because setting up and refreshing project environments is slow and complicated. Data virtualization can cut down on complexity, lower the total cost of ownership, and accelerate projects by delivering virtual data copies to platform teams more efficiently than legacy processes allow.
  • Analytics: Data clones can provide a space for designing queries and reports, as well as on-demand access to data across sources for BI projects that require data integration. This makes it easier to work with large amounts of data without damaging the original dataset.
  • Production Support: Data cloning can help teams identify and resolve production issues by providing complete virtual data environments. This allows for root cause analysis and validation of changes to ensure that they do not cause further problems.

To Conclude

Data cloning is the process of creating an exact copy of a dataset. This can be useful for many reasons, such as creating backups or replicating data for development and testing purposes. Data clones can be used to quickly provision new databases and test changes to production systems without affecting the live dataset.

This article provides a brief overview of data cloning, including its advantages, disadvantages, common use cases, and available tools. It is intended as a starting point for those who are new to the topic. Further research is recommended to identify the best solution for your specific needs. Thanks for reading!

GDPR Software 11

GDPR Software: 11 Options to Help You Comply in 2022

Businesses today have an ever-growing list of privacy restrictions to deal with when collecting and managing data. One of the most notorious pieces of privacy legislation is the EU’s General Data Protection Regulation (GDPR), which became the law of the land in 2018 and carries stiff penalties for violators.

Suffice it to say that GDPR compliance can be challenging. This is largely due to its size and scope as well as its evolving nature. In order to meet GDPR requirements, many organizations are turning to purpose-built software solutions that are designed to be GDPR-compliant out of the box. 

Without a doubt, this is the fastest and safest way to use data and avoid regulatory complications for businesses that sell to customers who reside in the EU.

What Is the GDPR?

The GDPR is one of the most comprehensive and far-reaching global privacy protocols implemented to date. It replaced the EU’s Data Protection Act and is now the main data privacy law in the EU.

While the GDPR is extensive, it boils down to some basic foundational principles. At a high level, companies that handle data from consumers in the EU need to operate with lawfulness, fairness, and transparency. They also have to limit the data they collect and focus on data minimization, accuracy, integrity and confidentiality, and accountability, among other things. 

GDPR also grants users eight basic rights to personal data and privacy, including the right to data access, data consent, data deletion, data portability, processing restriction, notification, and rectification.

The GDPR applies to all kinds of personal data, ranging from health and biometric data to basic identity information like names, mailing addresses, and email addresses. GDPR also impacts any company that collects or processes the personal data of residents of the EU regardless of the organization’s location. It applies to large companies as well as to small businesses with 250 or more employees.

Violators of the GDPR may face penalties of up to €20 million (about $23 million) or up to 4 percent of annual worldwide turnover from the previous financial year, whichever is larger.

What Is GDPR compliance?

When an organization is GDPR-compliant, it means the company meets the law’s various requirements for handling personal data.

The list of requirements is extensive. Some of the most important points involve designating an EU representative, embracing an opt-in mode of data collection, establishing time limits for breach notifications, and responding to customer requests for personal data.

Top GDPR Software Solutions To Consider 

In light of the extensive nature of GDPR, it comes as no surprise that organizations are struggling to comply. According to one study, 85 percent of U.S. companies believe that GDPR compliance regulations put them at a disadvantage against their European competitors.

Even though complying with GDPR is proving to be difficult for global businesses, recent technology advancements make it easier. In fact, there are a variety of GDPR compliant tools on the market that can help streamline workflows and keep you out of trouble. 

1. PrivIQ (Formerly GDPR365)

PrivIQ offers a one-stop shop for GDPR compliance. This platform provides everything you need to know to understand your company’s risks and to manage data privacy. 

Some of PrivIQ’s top features include data mapping, access to privacy notice and governance documents, breach logging support, and graphical dashboard reports. 

2. Onspring

Onspring provides cutting-edge risk management software that simplifies workflows, improves transparency, and helps maintain GDPR compliance. 

This software is excellent for capturing and remediating risks as they appear, including financial, reputational, and third-party threats. The software also makes it possible to control access by user, role, and group.

3. SolarWinds Access Rights Manager

SolarWinds Access Rights Manager (ARM) gives you everything you need to manage access rights across your entire IT environment to ensure GDPR compliance. 

Of note, the GDPR requires detailed user access monitoring. This is especially important for users with sensitive data. ARM can produce custom Active Directory and Azure AD reports, providing instant visibility into what different users can access. 

4. LogicGate Risk Cloud

LogicGate Risk Cloud is a cloud-based platform with prebuilt applications that perform a variety of critical GDPR-related functions. 

For example, the platform automates and centralizes customer requests, investigates breaches, and communicates with supervisory authorities. Additionally, LogicGate ensures that third parties are managing personal data effectively. 

5. Netwrix Auditor

Netwrix Auditor can minimize risk during a data breach. The platform quarantines sensitive data, secures overexposed documents, and manages privilege attestations, among other things.

By using a solution like Netwrix Auditor, your team can promptly discover security threats. If a breach occurs, you can spend less time combing through systems and databases and put more effort into dealing with customers and strategizing on a fix. 

6. OneTrust

OneTrust helps companies enhance their privacy programs. The platform offers prebuilt workflows, templates, automation, and regulatory intelligence to help operationalize data and remain in compliance with GDPR.

On top of this, this platform provides transparency about online tracking and captures consent for tracking technologies, cookies, and marketing communications. It also helps maintain and distribute policies and notices.

7. Vigilant Software Compliance Manager

Vigilant Software Compliance Manager identifies legal and regulatory information security requirements for GDPR. 

Using this software, your company can understand the specific actions that it needs to take to comply with various information security laws. Compliance Manager provides effective dates, direct links to legislation, and implementation requirements. 

8. Boxcryptor

Boxcryptor delivers advanced data protection using state-of-the-art encryption, which is a fundamental part of GDPR. 

With the help of Boxcryptor, your company can ensure that all data receives adequate protection in the cloud. The software encrypts files end to end locally on user devices before they go to cloud storage, enabling strong access control.

9. Didomi

Didomi is a leading privacy and consent management platform. The company offers Didomi for Developers, a comprehensive platform that runs on open APIs and helps integrate customer consent into operations.

Didomi makes it easy to build customer permission into your technology, enabling you to simplify privacy protection and preference management.

Further, the platform provides legal and business teams with real-time consent and preference data for easy compliance reporting. It also enables teams to know when consent is required when collecting new data or using it for different purposes. This in turn reduces risk and lets teams operate with greater confidence.

10. Iubenda

Iubenda makes apps and websites legally compliant across multiple legislations and languages, and GDPR is a main focus. 

With Iubenda, you can access helpful services like a privacy and cookie policy generator, a terms and conditions generator, and a consent solution. The company also offers a cookie solution to manage consent preferences for GDPR and other similar regulations.

11. Enov8 Data Compliance Suite 

Enov8’s Data Compliance Suite uses automated intelligence to identify security exposures and address issues before they lead to major incidents. 

The platform gives IT teams clean production-like data for developing and testing platform changes, eliminating complex and time-consuming manual work. 

Simply put, Enov8 enables teams to work faster and with less risk while eliminating costly remediation efforts and compliance issues. 

GDPR Compliance Can Be a Breeze with Enov8

Achieving GDPR compliance doesn’t have to be a nightmare. With the right software in place, your team can continue developing and using data at a fast pace while avoiding costly fines and penalties. 

To learn how Enov8 can help your organization achieve and maintain GDPR compliance, take the platform for a spin.

Post Author

This post was written by Justin Reynolds. Justin is a freelance writer who enjoys telling stories about how technology, science, and creativity can help workers be more productive. In his spare time, he likes seeing or playing live music, hiking, and traveling.

Supporting Data Privacy

Supporting Privacy Regulations in Non-Production

Every aspect of our daily lives involves the usage of data. Be it our social media, banking account, or even while using an e-commerce site, we use data everywhere. This data may range from our names and contact information to our banking and credit card details.

The personal data of a user is quite sensitive. In general, all users expect a company to protect their sensitive data. But there is always a slight chance that the app or service you are using might face a data breach. In that case, the question that comes to mind is how the company or app will keep your data safe.

The answer is data privacy regulations. Nowadays, most countries have their individual data privacy laws, and companies operating in those countries generally follow these laws. Data privacy laws protect a customer’s data in production. But did you ever think about whether your dev or testing environment is safe and secure?

In this post, we’ll discuss why you must follow data privacy regulations in a non-production environment. We’ll take a look at the challenges faced while complying with privacy rules, solutions to these challenges, and strategies to follow while implementing privacy laws in non-production. But before that, we’ll discuss a bit about privacy regulations. So, let’s buckle up our seat belts and take a deep dive.

What Do You Mean by Privacy Regulations?

Data privacy regulations ordata compliance is a series of rules that companies must abide by to ensure that they’re following all the legal procedures while collecting a user’s data. Not only that, but it’s also the company’s job to keep the user’s data safe and prevent any misuse.

There are various data privacy laws. For instance, companies operating under the European Union follow GDPR. On the other hand, the United States has several laws like HIPAA, ECPA, and FCRA. Failing to follow these rules results in potential lawsuits or penalties. The goal of these rules is to keep a user’s sensitive data safe and secure from malicious activities.

Now that we know what data privacy regulation is, let’s discuss why we need to follow these rules in non-production.

Why Privacy Regulations in Non-Production Are Important

While deploying an app or a site in production, we add various security protocols. But often, the environment where we develop or test our apps is not that secure. In 2005 and 2006, Walmart faced a security breach when hackers targeted the dev team and transferred sensitive data and source code to somewhere in Eastern Europe.

This kind of incident can happen to any company. Currently, many companies use production data for in-house testing or development. So, how does a company ensure that a user’s sensitive data is safe? The answer is data masking, which is one of the mandatory rules of data privacy regulations.

However, implementing data privacy rules comes with many challenges. Let’s explore some of them and the ways to resolve these challenges.

Challenges Faced While Complying With Privacy Rules

Adapting to something new always comes with certain challenges, be it some new tool, technology, or regulation. Data privacy is no exception. However, the challenges are not that complicated. With proper planning, overcoming them is quite straightforward.

Adapting to New Requirements

Data privacy regulations are generally process-driven. While implementing privacy rules in non-production, your team must welcome changes in the way they do things. This may involve data masking, generating synthetic data, etc. Your team will take some time to adapt to the new processes.

Chalk out a plan before the transition. Train your team and explain why they need to follow these regulations. With proper training and clarification of individual roles, adapting to the new changes won’t take much time.

New Rules of Test Data

If your testing team is using real user data for testing the essential features of your product, beware. The process is going to change. As per data privacy regulations, you cannot use real user data for testing, so the challenge comes while rearranging or recreating your test data.

However, with a proper test data management suite, the task becomes a lot easier than doing the entire thing manually.

Adjusting Your Budget Plan

Implementing any new process often involves spending a lot of money. While implementing privacy laws, you have to think about factors like

  • the research your teams need to do
  • the purchase and implementation of data compliance tools that will help you generate privacy-compliant test data
  • the arrangement of training sessions for your team
  • the hiring of resources to monitor or enforce compliance laws

All of the above and more will affect your budget, so it’s best to have a discussion with your finance and technical team. Figure out the zones where you should focus spending and calculate an approximate amount. Planning is beneficial if you want to avoid overspending. On that note, in the following section, we’ll discuss some strategies to follow while implementing privacy regulations in non-production.

Strategies to Implement Privacy Regulations in Dev and Testing

Although there is no end to planning strategies while implementing data privacy regulations, there are some important steps that we can’t miss.

Sorting Data

Before following privacy laws, you must know everything about your data. If the project is at a starting phase, there will be a lot of customer data. Discuss this with your team to categorize the data and clarify what data is sensitive to the user. Once you categorize the data and separate sensitive data from general data, it’s time for the next steps.

Encrypting Sensitive and Personal Data

GDPR and other data privacy laws make it mandatory for you to secure any sensitive data. Ensure that if you have any such data in a non-production environment, it’s secured by layers of encryption. Even if you’re not using the data, you must still secure it in your database. This is because no matter how strong your firewall is, hackers can always breach it. So it’s wise to protect sensitive data with layers of encryption apart from just a firewall.

Restricting Access to Database

As per most data privacy rules, your database should not provide overall access to all users. Since a database has multiple types of data, you must create roles and grant specific permission to each role. For instance, a tester should have access to test data only and not production data. Imagine if a fresher on your team deletes a table from the production database. The incident may happen by mistake, but it will cost the company a lot. Enforce these rules to prevent similar unfortunate mishaps.

Change the Policies of Cookies

If you’re developing a site, you’ll need to think about how your cookies work and whether they comply with the data privacy law you’re following. For instance, what if your website is operating outside the EU and the target audience is in the EU? In that case, apart from standard compliance, you need to comply with GDPR as well. As per GDPR, a website should collect a user’s personal data only after they agree to cookie consent. That means you should inform the user about the data used by your site’s cookies to perform specific functions. The information must be clear, and your cookies can collect data only after the user gives permission.

Use of a Compliance Monitoring Solution

Generally, companies often appoint a data protection officer (DPO) whose job is to monitor the processes, analyze the risk, and suggest measures so that your company never fails to comply with privacy laws. But a DPO is a normal human being. When it comes to large data sets, a human mind can always miss something. The solution? Provide your DPO with a compliance monitoring solution.

Enov8 provides such a solution that addresses the needs of compliance managers. The tool monitors your data and identity risks. Not only that, but the tool also helps you to find compliance breaches and points out processes that you need to optimize in order to protect the data.

Disclose Important Information to Users

Data privacy laws ensure that users should have all the knowledge about how companies are using their data. You must disclose everything about data usage while signing the agreements. Situations may arise later for which you may need to revise the agreement. For instance, suppose you’re monitoring the logs of a system that’s connected with the customer’s network. If the logs contain the user’s IP address or other sensitive data, inform the customer.

Synthetic Test Data Generation and Data Masking

There are some cases where you need real data to develop or test something. But what if the data compliance standard that your company follows prohibits you from using real data? Don’t worry. Synthetic data is the next best thing. Synthetic data is data generated by an algorithm and closely imitates the original data. You can also use data masking, where sensitive data is hidden and replaced by similar dummy data. The advantage? You can continue your work without any risk of failing to comply with privacy laws.

Train Your Team on Privacy Regulations

When it comes to complying with privacy laws, there is no end to learning and adapting to new things. It’ll be quite hectic for your team if you enforce a lot of rules on your team all of a sudden. Make the transition smooth by arranging training sessions for your employees to explain the need for compliance with privacy laws and the consequences if they fail to abide by these laws. In addition, train them on using data compliance suites. You can take a look at Enov8’s data compliance suite, which monitors your data and ensures you’re compliant with GDPR, FCRA, ECPA, and multiple other standards.

Keeping your test and dev data compliant with privacy laws may prove to be a little challenging at first. But if planned and executed in a phased manner, your team will adapt easily.


This post was written by Arnab Roy Chowdhury. Arnab is a UI developer by profession and a blogging enthusiast. He has strong expertise in the latest UI/UX trends, project methodologies, testing, and scripting.


SQL Versus NoSQL: What Is the Difference?

SQL vs. NoSQL? Which database architecture should you use for your next project? Which one is the “best”? Some argue that one is always better than the other. But they are very different technologies that solve different problems.

Let’s take a look at them and see where they differ and where they overlap.

SQL Databases

SQL databases support Structured Query Language (SQL), a language for working with data in relational databases. Broadly speaking, “SQL database” and “relational database” refer to the same technology.

A relational database stores data in tables. These tables have columns and rows. The columns define the attributes that each entry in a table can have. Each column has a name and a datatype. The rows are the records in the table.

For example, a table that holds customers might have columns that define the first name, last name, street address, city, state, postal code, and a unique identification code (ID). You could define the first six columns as strings. Or, the postal code could be an integer if all the clients are in the United States. The ID could be a string or an integer.

The relationships between the tables give SQL its power. Suppose you want to track your customer’s vehicles. Add a second table with vehicle ID, brand, model, and type. Then, create a third table that stores two columns: vehicle ID and customer ID. When you add a new vehicle, store its ID with the customer that owns it in this third table. Now, you can query the database for vehicles, for customers, for customers that own certain vehicles, and vehicles owned by customers. You can also easily have more than one vehicle per customer or more than one customer per vehicle.

Three common examples of SQL databases are SQLite, Oracle, and MySQL.

NoSQL databases

NoSQL database means many things. They’re databases that, well, don’t support SQL. Or they support a special dialect of SQL. Here’s a non-exhaustive list of the more popular NoSQL databases.

Key-Value Databases

Key-Value (KV) databases store data in dictionaries. They can store huge amounts of data for fast insertion and retrieval.

In a KV database, all keys are unique. While the keys are often defined as strings, the values can be any datatype. They can even be different types in the same database. Common examples of values are JSON strings and Binary Large Objects (BLOBs).

Two popular examples of KV databases are Redis and Memcached.

Document Stores

A document store operates like a KV database but contains extra capabilities for manipulating values as documents rather than opaque types.

The structures of the documents in a store are independent of each other. In other words, there is no schema. But, document stores support operations that allow you to query based on the contents.

MongoDB and Couchbase are common examples of document stores.

Column-Oriented Databases

Relational databases use rows to store their data in tables. What sets column-oriented databases apart from them is — as the name suggests — storing their information in columns. These databases support an SQL-like query language, but they store records and relations in columns of the same datatype. This makes for a scalable architecture. Column-oriented databases have very fast insertion and query times. They are suited for huge datasets.

Apache Cassandra and Hadoop HBase are column-oriented databases.

Graph Databases

Graph databases work on the relationships between values. The values are free form, like the values in a document database. But you can connect them with user-defined links. This creates a graph of nodes and sets of nodes.

Queries operate on the graph. You can query on the keys, the values, or the relationships between the nodes.

Neo4j and FlockDB are popular graph databases.

SQL vs. NoSQL Databases: Which One?

So, when you compare SQL and NoSQL databases, you’re comparing one database technology with several others. Deciding which one is better depends on your data and how you need to access it.

Your Data Should Guide Your Decision

Is there a perfect fit for every data set? Probably not. But if you look at your data and how you use it, the best database becomes apparent.

Relational Data Problems

Can you break your data down into entities with logical relationships? A relational database is what you need, especially when you need to perform operations with the relationships.

Relational databases are best when you need data integrity. Properly designed, the constraints that relational databases place on datatypes and relations help guarantee integrity. NoSQL databases tend to be designed without explicit support for constraints, placing the onus on you.

Caching Data Problems

Caching is storing data for repeated access. You usually identify cached data with a single key. NoSQL databases excel at solving caching problems, while relational databases tend to be overkill.

Key-Value stores are an obvious choice for caching problems. Many websites use Redis and Memcached for data and session information.

But a document store that saves documents for historical purposes or reuse is an example of a caching solution, too.

Graph Data Problems

If a graph database stores data with relationships between data, why isn’t it a relational database? It’s because in a graph database relationships are just as important as the data. The relations have fields, names, and directions. Graph queries may include relationships and their names, types, or fields. Relation queries also use wildcards, which account for indirect relationships.

Suppose a database represents rooms in several hosting facilities. It stores buildings, rooms, racks, computers, and networking equipment. This is a relational problem since you have entities with specific relationships.

There could be a table for each entity in a relational database and then join tables representing the relationships between them. But now imagine a query for all the networking equipment in a given building. It has to look in the buildings, find the rooms, look in the rooms for racks, and finally collect all the equipment.

In a graph database, you could create a relation called “contains.” It would be a one-way relation reflecting that one node contains another. Each item in each facility is a node contained by another, except for the buildings. When you query the database for networking gear, a wildcard could combine relationships between the buildings, room, and racks. This query models real life, since you say “Give me all of the gear in building X.”

Scalability: SQL vs. NoSQL

Which technology scales better? NoSQL may have a slight edge here.

Relational databases scale vertically. In other words, data can’t extend across different servers. So, for large datasets, you need a bigger server. As your data increases in size, you need more drive space and more memory. You can share the load across clusters, but not data.

Column-oriented databases were created to solve this problem. They provide horizontal scalability with a relational model.

Key-Value, document, and graph databases also scale horizontally since it’s easier to distribute their datasets across a cluster of servers.

SQL vs. NoSQL: Which One?

SQL and NoSQL are effective technologies. SQL has been around for decades and has proven its worth in countless applications. NoSQL is a set of technologies that solve a variety of different problems. Each of them has its own advantages and tradeoffs.

The question is, which one is best suited for your application? Take the first step by carefully modeling your data and defining use-cases to learn how you need to store and retrieve it. Then, pick the right technology for your application.

Author – Eric Goebelbecker

Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).

Test Data Subsetting

What is Data Subsetting in TDM

The foundation of a comprehensive and well-implemented QA strategy is a sound testing approach. And a sound testing approach, in its turn, depends on having a proper test data management (TDM) process in place. TDM’s responsibilities include obtaining high-quality and high-volume test data in a timely manner. However, obtaining such data isn’t an easy process and might create prohibitive infrastructure costs and unforeseen challenges.

This post is all about the solution: data subsetting.

We begin by defining data subsetting in a more general sense and then explain why it’s so important in the context of TDM. We then talk about the main challenges involved in data subsetting and cover the main methods in which it can be performed.

Let’s get started.

What Is Data Subsetting?

Data subsetting isn’t a hard concept to grasp. To put it simply, it consists of getting a subset or a slice of a complete dataset and moving it somewhere else.

The next step is to understand how this concept works in the context of TDM.

Why Is Data Subsetting Needed in TDM? Knowing the Pain

Data subsetting is a medicine that’s supposed to alleviate a specific pain. So if you want to understand what subsetting is and why it’s needed in the context of TDM, you need first to understand what this pain we’re talking about is.

A couple of sections ago, we definedtest data management. You learned that this process is in charge of coming up with reliable data for the consumption of automated test cases. And even though there are alternatives, one of the most popular solutions for this problem is merely copying the data from the production servers, which is often called production cloning.

Copying from production is a handy way of obtaining realistic data sets from testing since nothing is more real than data. However, this approach presents some severe downsides. The security-related challenges can be solved with approaches like data masking. This post focuses on the challenges related to infrastructure.

The Pains of Production Cloning: The Infrastructure Edition

You could probably summarize the infrastructure-related challenges of production in two words: high costs. If you want to copy 100 percent of your production data into your test environments, you’ll incur incredibly high costs for storage and infrastructure.

That’s not to mention the fact that you could potentially have not only one test environment but several. So we’re talking about multiplying this astronomical cost three, four, or even five times.

Besides the direct financial hit, you’d also incur indirect costs in the form of slow test suites. If you have gigantic amounts of test data, then it’ll necessarily take a long time for you to load it when it’s time for test execution.

Data Subsetting to the Rescue

Applying data subsetting in the context of TDM can solve or alleviate the difficulties of copying data from production. When you create test data by copying not the whole production database but a relatively small portion of it, you don’t incur the exorbitant infrastructure costs that you would otherwise. In other words, it’s like a partial production cloning.

What are the main benefits of using data subsetting in TDM?

The first distinct advantage is the decrease in storage costs for the test data. In the same vein, you’ll also incur fewer costs in overall infrastructure. This cost savings quickly becomes outstanding if you factor in multiple QA or testing environments, which you most likely have.

But the benefits of data subsetting aren’t all about cost. Test time is also impacted positively. Since there’s less data to load and refresh, it takes less time to do it. That way, the adoption of data subsetting can also reduce the total execution time of your test suites.

The Challenges Involved in Data Subsetting

There isn’t such a thing as medicine without side effects. So data subsetting, despite being able to cure or alleviate some pains in the TDM process, also comes with some pains of its own. Let’s look at them.

The first roadblock is referential integrity. Let’s say you work for a social network site and you’re implementing data subsetting. The site has one million users, and you’ve got just a hundred thousand of them for your test database, slicing from the users table. When you’re getting data from the other tables in the database, you have to make sure to fetch the posts, friendships, and pictures from just those hundred thousand users to keep the existing foreign-key relationships intact.

This roadblock becomes even more relentless when you factor in the possibility of relationships spanning multiple databases. There’s nothing that prohibits this organization from storing user profiles in a PostgreSQL database and posts in an Oracle database.

These types of relationships can be even harder to track and protect when they span not only multiple databases but multiple data sources. You could use relational databases for some types of data while others reside in .csv files, and a third type might be stored in some document-based NoSQL database. Such a variety of possible data sources certainly poses a challenge for maintaining referential integrity when doing data subsetting.

Data Subsetting Methods

Let’s now cover the main methods in which you can implement data subsetting.

Using SQL Queries

We start with the most straightforward approach. In the case of data subsetting, this translates as using plain old SQL queries.

For instance, in the social network example we used before, let’s say you have a total of ten thousand users (it’s a small social network, mind you) and you want to get a hundred users. This is how you’d do it, for instance, in PostgreSQL:


Now let’s say you want to fetch the posts from those hundred users. How would you do it? Here’s a solution:

SELECT * FROM posts where user_id in (SELECT id FROM users ORDER BY id LIMIT 100)

These are just simple examples. For a more realistic approach, you would have more complex queries stored in script files. Ideally, you’d check those in version control to any track changes.

The advantage of this approach is that it’s easy to get started. It’s pretty much it. You most likely have people with SQL knowledge in your organization, so the learning curve here should be nonexistent.

On the other side, this approach is very limited. While it might work well in the beginning, as soon as your subsetting needs start to get serious, it falls apart. It’s going to become increasingly harder to understand, change, and maintain the scripts. You have to bear in mind that changes to the “root” query (in our example, the query that fetches the hundred users) cascades down to its children, which further complicates updating the scripts.

Also, knowledge of query performance optimization techniques might be necessary. Otherwise, you might get into situations where poorly written queries become unusable due to outrageously poor performance.

Developing a Custom Solution

The second approach on our list consists of developing a custom application to perform the subsetting. The programming stack you use for creating this application is of little consequence. So just adopt the programming languages and frameworks the software engineers in the organization are already comfortable using.

This approach can be effective for small and medium-size teams, especially when it comes to cost. But at the end of the day, it has more downsides than benefits.

First, building such a tool takes development time that could be used elsewhere, so you’re incurring an opportunity cost here. Database knowledge would also be necessary for building the tool, which would partially defeat the purpose of having this abstraction layer above the database in the first place.

Also, the opportunity cost of building the tool isn’t the only one you would incur. After the solution is ready, you’d have to maintain it for as long as it’s in use.

Adopting Commercial Tools

Why build when you can buy? That’s the reasoning between the second approach on our list: adopting an existing commercial tool.

There are plenty of benefits in adopting a commercial tool, among which the most important are probably the robustness and scalability of the tool and the high number of databases and data sources it supports.

The downside associated with buying isn’t that surprising, and it boils down to one thing: high costs. These costs aren’t just what you pay when buying or subscribing to the tool. You have to factor in the total cost of ownership, including the learning curve. And that might be steeper than you expect.

Using Open Source Tools

Why buy when you can get it for free, besides having access to the source code?

Adopting open source subsetting tools is like getting the best of both worlds: you don’t have to build a tool from scratch, while at the same time you can see the source code and change it if you ever need to expand the capabilities.

The downside is the total cost of ownership, which might still be high, depending on the learning curve the tool presents.

TDM: Subset Your Way to Success!

Copying data from production to use in tests is an old technique. It’s a handy way of feeding test cases with realistic data. However, it’s not without its problems. Besides security and privacy concerns, production cloning can also generate substantial infrastructure costs and high testing times.

In order to solve those problems, many organizations adopt data subsetting. Using subsetting and techniques that deal with security and privacy concerns (e.g., data masking) allows companies to leverage production cloning safely and affordably.

Thanks for reading, and until next time.


This post was written by Carlos Schults. Carlos is a .NET software developer with experience in both desktop and web development, and he’s now trying his hand at mobile. He has a passion for writing clean and concise code, and he’s interested in practices that help you improve app health, such as code review, automated testing, and continuous build.


What is the Consumer Data Right (CDR)?

A DataOps Article.

What is the Consumer Data Right (CDR)?

You may have heard it mentioned, particularly if you’re in “Open Banking”. But CDR is the future of how we access and ultimately share our data with “trusted” third parties.

It will be introduced into the Australian banking sector initially from the middle of 2020, with scope/functionality evolving in phases, and ultimately roll out across other sectors of the economy, including superannuation, energy and telecommunications.

Vendor Benefits

The Consumer Data Right is a competition and consumer reform first!

  • Reduced sector “monopolization” (increased competition).
  • CDR encourages innovation and competition between service providers.
  • Access to new digital products & channels.
  • New, to be innovated, customer experiences.

Consumer Benefits

  • Immediate access to your information for quicker decision making.
  • Better transparency of vendor(s) pricing and offers.
  • Increase in products to support your lifestyle.
  • Consumer power e.g. ease of switching when dissatisfied with providers.

Vendor Risks

  • CDR Compliance is mandatory for Data Holders
  • Implementing CDR (on top of legacy platforms) is non-trivial.
  • Non-compliance penalties may be severe (fines and trading restrictions)
  • CDR is rapidly evolving & continually changing. Continuous conformance validation & upkeep required.
  • Increased access to data, means increased “attack footprint”.

Be warned! Although the CDR is expected to create exciting new opportunities, there are also clearly defined conformance requirements. In a nutshell, breaches of the CDR Rules can attract severe penalties ranging from $10M to 10% of the organization’s annual revenue.

Who is responsible for CDR?

Ultimately CDR may evolve to a point where it is self-regulating. However, at present at least, the accreditation of who can be part of the ecosystem (i.e. Data Holders & Data Recipients) will be controlled by the relevant industry regulators*.

*In Australia the ACCC is responsible for implementing the CDR system. Only an organisation which has been accredited can provide services under in the CDR system. An accredited provider must comply with a set of privacy safeguards, rules and IT system requirements that ensure your privacy is protected and your data is transferred and managed securely. 

How do consumers keep their data safe?

The CDR system is designed to ensure your data is only made available, to the service providers, after you have given authentication and consent.

Note: The diagram below, based on Australian oAuth2/OIDC security CDR guidelines, shows the key interactions between the Consumer, The Data Recipient (e.g. a Retailer App on a Phone) and a Data Holder (a Bank).

Australian CDR uses oAuth2/OIDC Hybrid Flow

Consumers can control what data is shared,  what it can be used for and for how long. Consumers will also have the ability to revoke consent and have information delete at any time.

CDR is the beginning of an interesting new information era. Learn more about the Consumer Data Right and accreditation on the CDR website.


Smells that indicate that you need TDM

You may have heard of code smells or even smelly test environments.

But, what about Data Smells?

In this post we discuss top smells that indicate you need Test Data Management.


In computer programming, a code smell is any characteristic in the source code of a program that possibly indicates a deeper problem.[1][2] Determining what is and is not a code smell is subjective, and varies by language, developer, and development methodology.

The term was popularised by Kent Beck on WardsWiki in the late 1990s. Ref Wikipedia.

Invariably every IT problem has a symptom that we call smells.

In this post lets focus on the most popular ones associated with Test Data Management

Top 15 TDM Smells

  1. Testers waste large amount of time creating data rather than testing the application.
  2. Data provides doesn’t meet the requirements for testing (has incorrect mix of data).
  3. DevTest cycle /and project slippage to data unreadiness.
  4. Dependency on other experts to provide the Test Data (experts that may have other priorities).
  5. System Data lacks integrity (is incomplete) and limits System Testing.
  6. Up or Down stream data hasn’t been prepared in similar fashion, causing E2E integrity issues.
  7. Data Related defects caused by data being in unrealistic state i.e. False Positives.
  8. Test Data Creation is (or is deemed) too complicated or time consuming.
  9. Test Data is too large causing refreshes to take too long.
  10. Test Data size (production size) causes performance bottlenecks & broken batch processing in smaller test environments.
  11. Data has been copied from production and has PII data i.e. Data is insecure.
  12. Testers (& developers) don’t understand what the platform data looks like. Resulting in the engineers fumbling in the dark as they try to exercise it effectively.
  13. Due to Data complexity, Testers can’t easily find (mine) the data sets once it has been deployed.
  14. Test results are being corrupted by testers “only” using/reusing the same small data sets  (data contention).
  15. Lack of data reuse (or automation) resulting in continuous reinvention and repeated mistakes.

In Conclusion

Test Data is an essential, if not somewhat complicated, and often ignored, aspect of effective Devops & Quality engineering. However treating it as an after thought will invariably result in the smells described above. Smells that will introduce suboptimal DevOps/DataOps operations, unwanted project delays and poor testing.

Do you have other ideas on Test Data Management Smells, then please let us know?

Post By Jane Temov

Jane is an experienced IT Environments Management & Resilience Evangelist. Areas of specialism include IT & Test Environment Management, Disaster Recovery, Release Management, Service Resilience, Configuration Management, DevOps &Infra/Cloud Migration. 

what is data analytics internal audit

What Is a Data Analytics Internal Audit & How to Prepare?

No one wants to deal with a data audit. You haven’t invested so much time into putting everything together just to have someone else come in and start raising questions. Hopefully, an audit will never happen to you. Still, the possibility that an audit could happen tomorrow is there, and this post is about what a data analytics internal audit is and how to prepare for it.

Continue reading “What Is a Data Analytics Internal Audit & How to Prepare?”

How to Organize a Test Data Management Team

So, you’ve recently learned about what Test Data Management is and why it’s amazingly valuable. Then, you’ve decided to start a TDM process at your organization. You’ve read about what Data Management includes, learned how TDM works, and finally went on to start implementing your Test Data Management strategy. But then you got stuck, right at the start. You’ve got a question for which you don’t have an answer: how to organize a Test Data Management team?

Well, fear no more, because that’s precisely what today’s post is about.

We start with a brief overview of Test Data Management itself. Feel free to skip, though, if you’re already familiar with the concept. We won’t judge you for that; we’re just that nice.

Continue reading “How to Organize a Test Data Management Team”