top of page
  • Writer's pictureRichard Blech

Securing Data Lakes

There are multiple use cases for data lakes. Many organizations will create the systems for archival or reference purposes, to comply with government data regulations or as a tool to aid in gaining insight from Big Data. Regardless of its purpose, securing the data lake remains a constant concern.

Cloud-based data lake vendors like Microsoft, Oracle, Amazon and IBM provide some data security measures, including encryption. For organizations that create their on-premise data lakes from open-source components like Hadoop, Storm or Spark, third-party data protection solutions are necessary to ensure that the data that is stored in and that travel back and forth from the data lake is protected.


Big Data projects, in particular, have contributed to the increased use of data lakes. Many organizations have had to address the problem of how to leverage the vast amount of disparate data they collect so that it creates value for the organization. For example, an organization with a website that has been in operation for a number of years and has accumulated zettabytes or petabytes of variable data in its database can encounter problems. As the volume of data increase, the analysis and reporting of that data gets progressively slower. Data lake is the one of the solutions that is meant to make it easier to gain insight from vast amounts of data.


In addition to storing structured data, data lakes can also house unstructured and semi-structured data, all in a nearly exact or completely exact duplicate of the source format. This provides more usage and flexibility than that of data warehouses, which require the preprocessing of any data that is deposited. With data lakes, users can explore the data, create their own queries, perform analytics and create predictive models. Using data lakes for diverse, original data increases the possibility of finding valuable patterns and insights.

It is also a prime cybersecurity concern. Data that is placed in a data lake can be more under threats than when it is in the systems from which has been copied. Consider exactly what a data lake can mean from a data security perspective: with a single breach, malicious can have access to some or all of an organization’s most valuable data.


Any approach to securing data lakes has to begin with a comprehensive understanding of the data. This means knowing how it is being used and the applications that have to access the data. Governance and access control are applied with the right policies and tools working in tangent with one another:

  • Validation of end users using multifactor authentication. Multifactor authentication

  • Authorization to specific data. One of the issues with regulating data access in data lakes and that can complicate securing the data—is that the file objects typically hold enormous amounts of data with multiple varying properties. A user may be authorized for one dataset in the object, but not the any others. One way around this is use solutions that can provide granular permissions and authorizations.

  • Auditing of users’ actions. Users’ actions have to be carefully monitored and records created detailing who is using which data in the data lake system and how much. Those without the proper permissions should be unable to access data resources.

  • Data encryption. Data has to be protect when it is in transit and at rest. Without encryption, malicious actors can use tools to circumvent access controls and gain direct access to the data.


Any technologies used to secure data lakes should be able to facilitate agile and responsible access to the data. In the matter of encryption for data lakes, there are certain factors to keep in mind:

  • The sheer volume of data being stored or transmitted

  • The size of the file objects

  • The wide range of data formats in a singe file object

  • The strength of the encryption (is it quantum-safe?)

The encryption solution has to be able to protect vast amounts of data without hindering performance or contributing to latency.

Let’s take a look at one cloud-based data lake vendor, AWS. For data in transit, AWS S3 uses TLS protocol for encryption between an application to the AWS service. It uses AWS KMS (with CMK and data keys) for data at rest.

The TLS protocol cannot be fully relied upon to protect against cyberattacks that target data in transit such as MITM. The AWS CMK provides only 256-bit encryption protection for files no larger than 4KB.

The use of a third-party encryption solutions, like the XSOC Cryptosystem for data at rest and XSOC EBP for data in transit, gives organizations more autonomy over the security of their sensitive data in data lakes on the cloud and on premises. Another advantage of using encryption solutions like those from XSOC CORP is that the encryption provided is FIPS 140-2 Certified quantum-safe, with a minimum 512-bit strength. XSOC’s EBP can be used to securely transmit extremely large data packets across networks at high speeds and low-latency.


Securing data lakes means securing the data that is stored in the data lake and that is transmitted in and out of the system. XSOC CORP offers the highest level data protection that organizations need today: certified solutions that provide hardened encryption protection for all forms of data without impairing the performance of applications. To ensure that your organization’s data can put up the best encryption defense against cyberattacks, contact one of our representatives.


bottom of page