Swimming through the data lake – key issues for data licensees
Data has the power to transform any business. However, for organisations that licence large amounts of third party data, good data management is absolutely essential… Read more
Data has the power to transform any business. However, for organisations that licence large amounts of third party data, good data management is absolutely essential in order to maximise value and avoid risks and liabilities arising from misuse of data.
We are seeing data licence audits occurring with more regularity. Data vendors are starting to realise the untapped revenue streams that can be generated from auditing licensees’ use of their data. In fact, some data vendors now have formal audit programmes; and licensees are increasingly finding themselves being forced to unravel their use of a myriad of licensed-in datasets, in order to provide usage details to the data vendors.
However, unlike software licence audits, an audit of data usage can be far more intrusive. Sometimes performed by a professional services company, such audits can be time consuming and expensive, causing significant business interruption. Further, depending on the terms of the relevant data licence, the licensee can find itself liable for the significant cost of the audit if a certain amount of unlicensed use is discovered, as well as damages for the unlicensed use. In a worst case scenario, a licensee could face the threat of losing access to the data, either because the licensor exercises a contractual right to terminate access, or seeks an injunction to prevent use by the licensee.
Further, the increasing use of open data creates potential risks to proprietary datasets. If open data licensed under an onerous licence is used in the wrong way, it can ‘infect’ the proprietary datasets, thereby triggering the obligation to make derived works open and destroying the commercial viability of the relevant project. In addition, there is the impact of GDPR and the threat of liability for misuse of personal data, both in terms of increased fines from the regulators and private claims through the courts.
In that context, this article provides an overview of the most important issues for organisations to consider if they consume large amounts of third party data. This article also sets the scene for a series of further articles which analyse the issues discussed below in more detail.
Mapping datasets to the relevant contractual arrangements
Third party data can be obtained from a multitude of different sources: publicly available sources, open data, information scraped from websites, and commercially licensed-in data. With the exception of publicly available data, all of those sources will have contractual terms which apply. In particular, data scraped from websites, obtained under copyleft-style open data licences or commercial licences may well have significant prohibitions on use which impact how it can be used.
However, many organisations don’t adequately map datasets to the contractual terms that apply to them, resulting in a lack of understanding about permitted usage and potential risks and liabilities being created. In big data projects, this sort of mapping exercise is not necessarily simple. With data being scraped by bots (sometimes via third parties) from various websites containing different terms and conditions, it is difficult to keep track of what website terms apply to particular datasets. In relation to commercially licensed-in data, often organisations may have hundreds (if not more) of inbound data licence agreements – many with large volumes of addenda (often with bespoke terms that take precedence over a master data licence) covering different data feeds. Therefore, it is sometimes difficult to track what contractual documents apply to a given dataset.
However, having this properly mapped out will enable organisations to reduce both the risk of liability for the misuse of data and the amount of management time required from in-house data, legal and compliance teams. Crucially it can also allow products to be developed and brought to market quicker.
Understanding relevant data licence terms
Organisations that consume large amounts of third party data will have contractual obligations under a myriad of commercial and/or open data licences and/or website terms and conditions, many of which will impose restrictions and prohibitions on the licensee. A lack of understanding of those restrictions and prohibitions may result in difficulties in data use permissioning (on which, see below), which can lead to the use of data for unlicensed activities.
Therefore, while mapping datasets to the relevant contracts is helpful, best practice is to also include details of the relevant terms relating to licence scope and prohibitions/restrictions on use. Such an exercise will enable relevant teams within an organisation to quickly establish whether certain use cases are permitted or not, thereby enabling quick, efficient and accurate decision making. It will also facilitate accurate reporting of management information and contractual positions against particular datasets.
Data licensees could also greatly benefit from tracking audit clauses so as to programmatically know when audits might be likely and to take pre-emptive action in carrying out pre-audit work and remediating issues before licensors carry out formal audits.
Data licences (particularly in the market data space) often contain ambiguities which can make it difficult for licensees to properly understand the extent of their rights and the scope of prohibitions on use. In part, this results from a lack of clarity across the industry as to the meaning of certain terms which are commonly used in data licences.
By way of illustration, there is often uncertainty relating to terms dealing with usage scope (for example, if the licence states ‘non-commercial use only’ – when does use become ‘commercial’ in nature?). Other areas where disputes can arise is in relation to ‘internal’ use, or rights to create aggregated or derived data, as often those terms are not defined.
Therefore, in working on data mapping and user permissioning, organisations should take account of any ambiguities in their data licences, particularly those that contain significant restrictions or prohibitions, so that business teams can quickly assess the risks associated with using certain datasets for particular use cases.
Who is using which data sets?
Organisations might be holding vast quantities of data from third party sources, but many do not have any controls in place to track which business units are accessing particular data sources from internal data repositories.
Data is so readily available that it can be all too easy for business teams to access datasets without any audit trail. This leads to a lack of understanding of who is using what datasets within the organisation and (crucially) for what purpose. Any company who has licensed-in large quantities of data should address these issues and implement processes and procedures to track what datasets are being accessed by which teams, how, when and for what purpose.
Rigour around data usage permissioning
Data is so easy to consume. Once it hits an organisation’s central repository, there is a tendency for business teams to get carried away with the development of new products or services and to access data for those purposes. All too often we’re seeing business teams consuming data for use cases that are not checked against the terms of the relevant licences, thereby unintentionally exposing the business to the risk of liability under its data licences.
These risks can be mitigated through the implementation of mandatory processes which require business teams to obtain authorisation for the use of third party datasets by reference to the scope and prohibitions of the relevant licences, and then record such use in a central inventory. These processes do not have to be time consuming or complicated and can be tailored for the types of data that a particular organisation has licensed. The permissioning process can also be tied to data mapping, so that different tiers of permissioning can be implemented for different types of datasets, depending on the restrictions and prohibitions in the relevant licences.
Third party access
One area that is often overlooked by licensees is the issue of access to licensed-in data by third parties who are engaged to provide services which might require use of that data (for example sub-processors, cloud providers and product/service distributors). A common issue we see is that the relevant licences are personal to the licensee, and for internal use, (with sub-licensing and sub-contracting prohibited) and/or for a set number of users within the licensee’s organisation.
In such a situation, the licensee might be in breach of the data licences if it allowed third parties to access or use the data in the course of providing services to the licensee. This is, of course, a tricky issue to navigate, as licensees will not necessarily know who will need access to particular datasets when entering into the relevant licences; and therefore these issues can often be difficult to address at the outset. However, this reinforces the point above that rigour around permissioning of data usage, particularly in relation to aggregation, distribution and use in products/services, is crucial to ensure that organisations are not exposing themselves to potentially significant risks and liabilities through the use of licensed-in data.
To the extent that personal data is contained within the data set that is being used on a particular project, the licensee will need to satisfy itself that any processing (including the use of or further sharing) of personal data is carried out in accordance with the principles and obligations set out in the General Data Protection Regulation and the Data Protection Act 2018.
In particular, the licensee will want assurances that the licensor has:
- a lawful basis for collecting and sharing any personal data contained in the data set; and
- provided all necessary notices to individuals whose personal data is contained in the data set.
Given the potential compliance burden, organisations may wish to consider whether it is appropriate to anonymise any personal data contained in data set. The benefit of this is that personal data that has been anonymised is not subject to the GDPR and can therefore limit the compliance risk. However, such anonymisation should be carried out with great care, as any personal data sets which have not been truly anonymised (i.e. because an individual can still be identified from the data set) will remain subject to the terms of the GDPR.
Be careful with open data
Open data represents a compelling opportunity for innovation. Organisations can use open data to significantly enhance and augment their existing data sources, develop personalised or contextually-aware features or create new products.
However, while open data can be freely obtained, it does not necessarily follow that it can be freely used by anyone, for any purpose. Like open source software, open data can be licensed under various different licences. Many of those licences will be permissive, but some contain more restrictive terms which are equivalent to the ‘copyleft’ term in open source software. If such open data is used, there will be a risk that any derivative work that is created using the open data would need to be made open to the public and licensed under the restrictive open data licence. Clearly that could compromise an organisation’s proprietary data, as well as the commercial viability of the project that the open data is being used for.
Therefore, if using open data, as well as commercially available data, it is equally important to understand the relevant data licence terms and to have procedures in place to ensure that permissioning of open data use takes account of obligations in restrictive licences.
Use of software tools
Clearly there is a lot to consider for organisations who licence large volumes of third party data. Different sources will have different obligations and licensees need a way of tracking where datasets come from, the relevant contractual terms which govern use of the data and to have policies and procedures in place to ensure data use is efficiently and accurately permissioned.
This is where software tools can help. Huge amounts of time and cost can be saved by implementing software tools to assist with data mapping and use case permissioning. Licensees can implement systems which contain details of all datasets that an organisation has licensed, details of the relevant contractual documents and the key terms relating to licence scope and usage rights. This would enable business, legal or compliance teams to quickly look up datasets that an organisation wants to use for a particular project, make an informed assessment of the risks associated with using the data in question and take steps to mitigate those risks.
Want to know more?
To discuss any of these issues, navigating data audits and disputes, negotiations with particular data vendors, or software and systems to manage data use and control, please feel free to contact me directly at Jeremy.Harris@Kemplittle.com.