I would like to introduce the topic of GDPR in the Analytics world in this post before developing it more practically in a next one.
In case you missed it, GDPR enforcement date is coming: 25 May 2018.
Nothing new, just enforcing it!
Over the last 2 years I heard GDPR everywhere. Almost once a week first, close to once a day later. At every single conference I have been at least a session related to this topic. Everybody was pointing at May 2018 as the date when Earth would stop spinning. The new General Data Protection Regulation (GDPR) will change the rules of any business related to data.
In the end it’s not really such a tsunami: the new regulation is already valid since 2016 or so, the difference is that it will be enforced from May 25th 2018.
Although it is a European Union regulation it will be applied wider as, in theory, it applies to any corporate handling (storing or using) data related to European citizens. Of course US, Asian or even Swiss companies could say they don’t care, but if they do they also better avoid doing business in the European Union and will be in trouble if having a local branch in one of the European countries where the regulation will be enforced. Last but not least: in the future other countries will probably adopt new regulations with similar rules or inspired by GDPR (as far as I know Switzerland is working on a refreshed version of its own regulations at this time).
What’s the challenge with GDPR?
But what’s GDPR about? What does really change? Tons have been written on this topic, Google it and you find thousands or results. I’m not a lawyer and I prefer to not go into details (so you can’t blame me if missing few points): keeping it really simple, to me it’s about moving the ownership of personal information back to the user instead of the corporation collecting it. It’s considered as personal information any information related to a European citizen in a direct or indirect way (so it’s of course names and emails but also IP addresses etc.).
The key part is that companies can’t use and abuse data as they want anymore. They must be able to justify where and what is used by who and for what reason assuming the user allowed them to do so. And the user also has the right to be forgotten, to ask a company to delete his data and the company must actually do it (and not just say they did but keep using the information internally).
The big new challenge of GDPR are the fines it brings with it: now it’s really lot of money with fines of 20 million euros or 4% of global revenues of a company. This is really the argument making the companies moving and looking to be GDPR compliant.
GDPR & Analytics
As said I heard about GDPR in the last 2 years. But it was mainly related to databases or ERP and CRM tools. Or even about Machine Learning (ML) as a company must be able to explain how a ML algorithm “took” a decision.
And what about analytics? I maybe missed a point over the last 10 years in the business but I often “heard” about corporate analytical platform like OBIEE or Oracle Analytics Cloud. A single platform accessing and using multiple data sources, merging them to provide end users with a unique unified view of the truth in the company. And such platforms are often widely used with more and more employees having access to it.
The result is often that many employees can access through an analytical platform lot of data. Some are allowed to export it locally (who isn’t aware of Excel being the most used Analytical tool in the world?). Even receive it by emails or via shared folders by some automated reports delivering data around.
Reducing GDPR to a simple “know who can access what, how and why” question, to me the challenge is obvious. It’s actually a lot more difficult to verify GDPR compliance in an analytical platform than in a database. (And I’m not even talking about self-service Analytics here!)
Data Lineage: the answer to GDPR compliance
Thinking at the questions GDPR asks it’s possible to answer mainly by implementing Data Lineage: being able to follow every single bit of data from a source till the screen of the employee accessing it. On top of it the security model need to be “flattened”, resolving all the inheritance aspects at all the levels (users – groups and groups – groups in a LDAP, users – application roles, groups – application roles and application roles – application roles in OBIEE / OAC). Merge these 2 sets of information together and we have a full view on all the accessible pieces of information.
Doesn’t mean employees export or use the information but having access to it is enough to get in trouble with GDPR if one day something happens.
Data Lineage and Security means lot of metadata. Having metadata is a good start but isn’t the end of a GDPR compliance process: storage and analysis of these metadata is the next step.
The answer is a graph database!
Graph Database as support
A graph database is quite simple theoretically: a set of nodes (aka vertices) and a bunch of edges connecting nodes together. Each node and edge can have a random list of properties. It’s easy to imagine the power of such kind of solution in storing any kind of set of data.
A graph actually perfectly matches data lineage and security metadata because an analytical platform is about data flows: bits going from a database, through some calculations to be modified or adapted and finally to the user’s screen. A graph with nodes and “directed” edges pointing to the following nodes can perfectly represent the various flows.
Graph databases also support a whole new set of algorithms which are specifically developed and optimized for graphs, like identifying shortest paths between nodes, making it simple to analyse the metadata of Data Lineage and Security.
In a next blog post I will develop a bit more in detail what does Data Lineage and graph database for GDPR in an Analytical world means. Stay tuned …
Second part online: GDPR & Analytics: the Solution