Generative AI: Data Protection Challenges

Maurya Velpula

Sep 6, 20246 min read

Updated: 4 days ago

Generative AI (GenAI) has become an unprecedented phenomenon – it can generate seemingly innovative content and increase efficiency by performing tasks like summarization and automation.

GenAI relies heavily on data. Types of data involved with GenAI models include:

Training data: Could comprise of public information, proprietary information and/or third-party data. Some examples are text, images, video or any other form of information that may consist of personal data.
Input data: Used as a trigger or prompt for the GenAI model. This could also consist of text, images or any other forms of information, which may include personal data.
Output data: Responses generated by the model based on its training data and the input prompt.

Given this heavy reliance on data, there are number of concerns or challenges that GenAI brings. Here are some that we will look at more closely at:

Data privacy and security concerns
Biasedness
Ethical considerations
Difficulty in explainability and interpretability

What could go wrong?

1. Data Privacy and Security Concerns

As a corporation, data privacy and security should be your utmost priority when using GenAI models. Inadequate data governance can lead to customers undermining the trust and reliability of your organisation.

Firstly, AI algorithms are trained using data available on the internet, therefore raising concerns about whether personal data was included in such training data, and whether valid consent was obtained for the processing of personal data, if any.

Secondly, confidential information about your company, including business strategies, customer information, could possibly be shared with GenAI models by your employees. This data will then be used by the system as training data to further generate responses for other users, essentially resulting in a data leak. On one occasion, Samsung’s employee used ChatGPT for code review, which resulted in a leak of Samsung’s sensitive internal code. Similarly, it would also be considered a privacy and confidentiality violation should employees input legal documents or medication within GenAI models to generate summaries.

Lastly, such models are also susceptible to various forms of attacks, such as prompt injection and data poisoning. Prompt injection differs from data poisoning in that it involves manipulating the prompt sent to the GenAI model and influence the model to respond in a malicious manner. This technique is also similarly used during model training, to guide the model to produce more appropriate responses. Data poisoning, however, occurs when threat actors inject malicious data into the training data sets which in turn negatively affects the response generated by the model as well.

A 2024 privacy benchmark study conducted by CISCO conducted on 2,600 security and privacy professionals from 12 countries highlighted that 69% of respondents were concerned that GenAI could hurt their organisation’s legal and intellectual property rights. 68% were also worried that information input into GenAI technology could potentially be shared publicly, or even with competitors.

With the looming threat of data privacy and security issues, over a quarter (27%) of organisations have already taken steps to ban, at least temporarily, the use of GenAI amongst their employees.

Furthermore, despite being seemingly aware of GenAI’s associated privacy risks, only approximately half of respondents were refraining from entering personal or confidential information into GenAI models.

2. Biasedness

Achieving integrity and data assurance for GenAI can also be difficult as ensuring the quality and accuracy of training data used can be complex and time-consuming. Having data integrity ensures that responses generated are accurate, reliable, complete and valid. This therefore means that there is a heavy reliance on training data for the model to perform well.

As such, these LLMs can inadvertently produce biased or inappropriate content in relation to certain topics. This results from stereotypes present within the datasets used to train these models. Biased data will inevitably lead to biased models, as GenAI relies heavily on data to function. If the data used in training is warped or incomplete, this results in the models perpetuating these biases, thus possibly leading to these models generating inaccurate outputs.

3. Ethical Considerations

Another common issue is the creation of realistic but false content, which may be exploited by users and lead to negative publicity surrounding that topic. For instance, deepfakes are used by malicious creators to spread misinformation, especially in the realm of politics.

In the healthcare industry, responses generated by GenAI models may also be factually inaccurate, which may result in incorrect diagnoses and potentially have severe implications on the patient. If the patient’s personal data was input to obtain a response from the GenAI model, it would also be subject to the applicable privacy laws.

4. Difficulty in Explainability and Interpretability

How does a GenAI model decide what response to populate, or what data to use to generate a response? The lack of transparency in algorithms behind GenAI models can make it difficult to explain or interpret, thereby questioning reasonableness. Being able to explain why a decision was made can support the verifiability of generated content and better improve issues prevalent in GenAI models, such as hallucinations.

While explainability and interpretability may be interchangeably used at times, they are different. Interpretability focuses on the inner mechanisms of a model to understand exactly how and why the model is generating the responses, while explainability is about explaining the behaviour of the model in human terms.

Are there regulations governing the safe use of GenAI?

The rapid adoption of GenAI systems worldwide has also brought about several regulations and guidelines in countries like the USA, China, and Japan. The increasing guidelines governing the safe use of GenAI demonstrates the qualms and worries behind the risks associated with GenAI models.

Countries with AI guidelines/regulations

(non-exhaustive)

Within these regulations and guidelines imposed by various governments, common themes were with respect to the core principles of AI – human rights, sustainability, transparency, and strong risk management. The regulations also promote a risk-based approach which is to be applied in conjunction with cybersecurity, data privacy and intellectual property considerations.

What can you do to mitigate risk?

1. Perform a Data Protection Impact Assessment (DPIA)

Identifying potential risks is a fundamental part of protecting data. An essential portion of this is performing DPIAs to assess the impact of a certain process or system implementation with regards to protecting the data held by your organisation.

Inventorise all AI algorithms, where software discovery tools may be useful to automate some of the manual work required
Like starting any new personal data processing, implementing any new technology or system, start with a DPIA when adopting/modifying your GenAI
Conduct a DPIA for all AI algorithms to be used
Assess risks using risk frameworks such as the NIST AI Risk Management Framework

2. Manage risks

Upon identifying the possible risks associated with your GenAI model, risk management measures must be put in place.

Establish AI governance
Apply Privacy-by-Design and Security-by-Design principles in your GenAI onboarding, irrespective of whether you are licencing, integrating or building your own GenAI
Establish ethical standards and a policy framework, which clearly outlines Do’s and Don’ts and acceptable use of GenAI for your business as usual (BAU) processes.
Implement controls – DLP, data anonymisation where possible, and access limitation
Protect your training data
Conduct staff training in relation to data privacy and protection while using GenAI

3. Validate your GenAI models

Validating your GenAI model is also important to ensure its accuracy and effectiveness. Doing so helps the model make better predictions and more precise outputs.

Adopt validation tools and techniques to confirm that algorithms are:
- Performing as intended; and
- Producing accurate, fair and unbiased outcomes
Closely monitor changes to the algorithm’s decision framework

4. Conduct regular independent audits

To increase transparency and accountability while still being able to utilise GenAI models for beneficial purposes, it is also crucial to undergo independent audits on a regular basis. Embed these audits in your annual schedules and obtain the necessary assurances on your GenAI models.

Independent audits can be conducted in the following areas:

Design
Ethical considerations
Data security
Data privacy
DLP

Conclusion

Despite there being risks to using GenAI from a data privacy and security standpoint, these risks can be managed leveraging your typical privacy risk management techniques.

Start with a DPIA, establish governance over your GenAI models and monitor this process continuously. Privacy-by-design and security-by-design frameworks can serve as your guiding principles to be applied into the technology adoption and development process. While this may seem like it can be easily integrated into your existing risk management, it is important to be mindful about overlaying the design and technical aspects of GenAI.