Bridging the Gap Between Data and Artificial Intelligence: The Power of Data Governance
Laurent Philonenko, Hassan Lâasri
May 31, 2024
Executive Summary
Effective data governance is critical for deploying generative artificial intelligence (AI). Without proper data governance, AI models may produce inaccurate, biased, or harmful outputs, particularly when training data comes from diverse internal and external sources. Data governance helps organizations provide accurate results, establish ethical AI development guidelines, prevent misuse, protect sensitive information, and build stakeholder trust. Companies should adopt agile data governance practices that can scale and adapt to support their AI initiatives, including flexible data management, automated quality monitoring, and maintaining accurate data lineage. Balancing data governance with innovation requires a strategic approach, establishing clear frameworks while fostering a culture of experimentation. Effective data governance enables AI models to deliver more accurate, reliable, and trustworthy predictions. It achieves this by ensuring high-quality, dependable data. As the AI landscape evolves, organizations must stay informed about emerging data governance trends and best practices to keep their frameworks relevant and effective in supporting their AI initiatives.
Data modernization is essential for business
A May 2024 Toughtworks/MIT study[1] showed that companies are modernizing their data, the top two reasons being better decision making and support for AI models. The study also points to major challenges: only 39% of companies have a data strategy aligned with key business objectives, substandard data and untimely delivery most need improvement, and security and compliance are impediments to modernization, cited by 44% of the study respondents. This article focuses on one critical aspect of data modernization, data governance.
Data Governance: The Art and Science of Turning Data into an Asset
Implementing a comprehensive AI strategy involves more than just compiling data in a data platform because data comes in multiple forms, each with its own characteristics and value.
Data Varies by Type:
Businesses typically deal with three types of data: ‘owned’ data, ‘desired’ data, and ‘required’ data. ‘Owned’ data includes information companies possess about their customers and operations. ‘Desired’ data is the information gathered about the market, potential customers, and competitors. ‘Required’ data encompasses information companies are obligated to report due to regulatory laws.
Data Varies by Value:
For retail businesses, data are key for marketing, customer relationship management, optimizing their campaigns and demand planning. In the luxury sector, the most important data pertains to stores and supply chain management, as these businesses typically do not engage in mass campaigns and know their customers already. For fund managers, crucial data includes information about companies, the economy, the stock market, and regulations, which are used to develop sophisticated investment products.
Data Also Varies in Age, Structure, Format, Quantity, Quality, and Usefulness:
What is important to one division might not be relevant to another, even within the same organization. For example, in the luxury sector, a product like a dress or a bag may be viewed through different attributes depending on the database where it is stored. It is not uncommon for an item to have hundreds of attributes owned and managed by different applications.
Successful companies not only have an AI strategy and a data platform to execute this strategy but also implement strong data governance. This ensures that the data they value the most is properly organized, compliant with regulations, and ready for use in generating business value. Data governance, distinct from data management or data quality management, encompasses the organization, processes, and tools necessary to prepare data for activation. This enhances customer experience, optimizes operations, or helps develop new business models. Without proper data governance, AI cannot fulfill its promise of transforming data into an asset to value or monetize.
The Criticality of Data Governance in the Context of AI
In the context of AI, data governance becomes more important as the training of the model may be based on data and documents from inside and outside the organization that must be regularly checked and updated. This is due to the nature of AI models, especially LLMs (large language models[2]) which learn to create new content by analyzing existing data. The data used for training can include text, images, audio and video, from internal and public sources, making it crucial to have clear policies and procedures for data management, accuracy, security, and protection. AI models may produce inaccurate, biased, or even harmful outputs without proper data governance. The recent goofs by Google’s Gemini suggesting to add glue to pizza or to eat rocks may seem comical, but they reveal what happens when no or poor data governance is in place. In this case, a person immediately understands that the LLM’s output is wrong, but in many other cases, the distinction between correct and harmful is not necessarily clear to the general public.
Effective data governance is essential for ensuring the ethical use of AI. It helps organizations establish clear guidelines for data usage, ensuring that AI models are trained and deployed in a manner that respects privacy, avoids discrimination, and promotes fairness. For example, mortgage application approval scoring models have been criticized for their biases and lower accuracy for minorities[3]. It is understandable that such issues erode trust, or at least do not contribute to building it.
Organizations should adopt agile data governance practices that can quickly scale and adapt to changing business needs. This includes implementing flexible data management processes, leveraging automation for data quality monitoring, and regularly reviewing and updating data governance practices. Additionally, organizations make data governance an integral part of their culture, encouraging employees to participate actively in data management and ensuring that data governance is integrated into all AI development and deployment aspects.
Credit: The Rise of the Data Marketplace: Data as a Service by Dave Wells, Eckerson Group
Data lineage is a critical component of AI data governance. It involves tracking data’s origin, movement, and transformation throughout its lifecycle. This means knowing where the data comes from, how it changes, and where it goes, providing visibility into how data is used and transformed within AI models. Organizations can ensure data quality, traceability, and compliance by maintaining accurate data lineage, enabling them to identify and address issues quickly and effectively.
Balancing Data Governance and Innovation in AI
Balancing data governance with innovation and agility requires a strategic approach to AI development. This can be achieved by implementing agile data governance practices, such as iterative development and continuous improvement, and leveraging technologies that enable rapid prototyping and deployment of AI models. By striking the right balance between data governance and innovation, organizations can unlock the full potential of AI while ensuring data quality, reliability, and compliance.
Data governance is often regarded as challenging and time-consuming, but it is critical. It ensures that the data used for training and deploying AI models is high-quality, accurate, and unbiased, leading to more accurate, reliable, and trustworthy predictions.
Effective data governance requires the implementation of clear policies and processes. This includes establishing data ownership and accountability, defining data quality standards, implementing data security and access controls, and regularly reviewing and updating data governance practices. Organizations should also invest in employee data governance training to ensure consistent application of these policies across the enterprise.
In addition to ensuring data quality and reliability, data governance for AI must also prioritize data privacy and compliance. Organizations should implement robust data protection measures, such as anonymization and encryption, to safeguard sensitive information. To avoid legal and reputational risks, they must also ensure that their data governance practices align with relevant data privacy regulations, such as GDPR and CCPA.
AI models require vast amounts of data to learn and make accurate predictions. Without proper data governance, the data used for training machines can be incomplete, inaccurate, or biased, leading to flawed outputs and poor decision-making. Effective data governance practices, such as data cleansing, integration, management, and security, help ensure that AI models are based on reliable and trustworthy data.
Overcoming the Challenges of Data Governance for AI
Implementing effective data governance for AI has its challenges. Organizations may need help to align data governance practices across siloed departments or business units. They may also need support from employees willing to adopt new data management processes. Additionally, the rapidly evolving nature of AI technology can make it difficult to keep data governance practices up to date, requiring ongoing review and adaptation.
To ensure the effectiveness of their data governance initiatives, leaders should establish clear metrics and Key Performance Indicators (KPIs) to measure success. This may include tracking improvements in data quality, model accuracy, operational efficiency, and cost savings. Businesses can justify the investment and secure ongoing support for these initiatives by demonstrating effective data governance’s tangible and numerical benefits.
As AI evolves, organizations must stay informed about emerging trends and best practices in data governance. This may include using automated data quality monitoring tools such as Collibra, integrating data governance with MLOps tools such as Databricks, and adopting federated learning[4] on and differential privacy techniques to protect sensitive data[5]. By staying ahead of these developments, organizations can ensure that their data governance frameworks remain relevant and effective in supporting their AI initiatives.
Data Cleansing and Storage Strategies for AI-Powered Systems
Data cleansing is a crucial part of data governance. It involves identifying and correcting data inconsistencies, errors, and inaccuracies. Data cleansing ensures the data is accurate, reliable, and consistent, making it suitable for analysis and decision-making. Typical issues when preparing or collecting data include data duplication, missing values, spelling errors, inconsistent formatting, and incomplete records. These issues can lead to biased analysis, incorrect predictions, and inaccurate decision-making. These issues can be resolved by cleaning the data and making them ready for informed decisions.
A Step-by-Step Plan for Implementing Data Governance in the Context of AI
Data governance is not a project with an end but a program that keeps running as long as data changes due to the business’ evolution, data strategy changes, and new regulations. Despite data variability previously listed — type, value, age, structure, format, quantity, quality, and usefulness — , implementing data governance always goes through the following steps.
Step 1: Appoint a Data Governance Team
Appoint a data governance team that includes representatives from IT, data, business, and legal departments. The team should be responsible for implementing and enforcing the data governance framework, and should have the authority and resources to make decisions and take action.
Step 2: Define the Scope and Objectives
Define the scope of the data governance initiative, including the types of data, AI models, and business processes that will be covered. Identify the key objectives and success metrics for the initiative, such as improving data quality, reducing bias, or enhancing customer experience.
Step 3: Establish a Data Governance Framework
Develop a data governance framework that outlines the roles, responsibilities, policies, and procedures for managing data in AI initiatives. The framework should cover data collection, cleansing, management, security, and privacy, and should align with relevant regulations and standards.
Step 4: Implement Agile Data Governance Practices
Adopt agile data governance practices, such as iterative development, continuous improvement, and automated quality monitoring, to ensure that the data governance initiative is flexible, scalable, and responsive to changing business needs.
Step 5: Develop a Culture of Data Governance
Foster a culture of data governance across the organization, encouraging employees to participate actively in data management and ensuring that data governance is integrated into all AI development and deployment aspects.
Step 6: Implement Data Lineage and Metadata Management
Implement data lineage and metadata management tools to track the origin, movement, and transformation of data in AI models, and to provide visibility into how data is used and transformed.
Step 7: Monitor and Report on Data Governance Performance
Monitor and report on data governance performance regularly, using the success metrics and KPIs defined in Step 1. The reports should be shared with the data governance team, senior management, and other stakeholders to ensure transparency, accountability, and continuous improvement.
Step 8: Stay Informed about Emerging Trends and Best Practices
Stay informed about emerging trends and best practices in data governance, such as DataOps, federated learning, and differential privacy, to ensure that the data governance framework remains relevant, effective, and up to date.
Use Case: Data Governance in an Insurance Company
An insurance company with three major brands and 11.5 million members was undergoing a digital transformation to improve their customer experience and internal processes. They implemented Salesforce for marketing, sales, and customer relationship management, and integrated advanced data analytics technologies to enhance their performance. However, they faced several challenges, including heterogeneous customer data and inconsistent treatment across entities, a lack of a comprehensive view of customers due to different data silos, and inadequate personalization of direct marketing and customer relationship management.
To address these challenges, the company’s management decided to implement a data governance program. The program’s mission objective was to integrate and harmonize data from CRMs across brands and subsidiaries, manage data quality, including validity, veracity, and completeness, and ensure data security and compliance with GDPR regulations. The program also aimed to create a repository of all acquired data for easy access and use.
The program established a Data Office with five data managers and 160 data owners, stewards, and architects. They created four work streams for ingestion, integration, and harmonization of data, tracking data quality, complying with GDPR, and referencing external data. A weekly operational committee and a monthly steering committee were set up to ensure alignment and track progress. The company also conducted workshops to address data harmonization, quality, regulatory, and security aspects. A community portal was created for search, sharing, and collaboration between teams and departments.
The data governance program resulted in the integration and harmonization of contract data on 11.5 million customers. The company established regular strategic and operational committees for data management strategy alignment and progress tracking. They also implemented a process for continuous data quality management. The collaborative portal for data science and data governance teams improved data discovery, sharing, and collaboration.
The company’s digital transformation was a success, and they were able to improve their customer experience and internal processes. The data governance program was a critical component of this success, and it enabled the company to fully exploit their data and gain a competitive edge in the market. The program’s implementation was a professional and well-executed example of how data governance can help organizations overcome their data-related challenges and achieve their goals.
About The Authors
Laurent Philonenko is a managing partner at Deeptech Group, an AI advisory firm. As CEO and CTO, he has led large organizations and developed and implemented customer experience applications for small and large businesses. Deeptech Group’s activities include advising startups on strategies, and enterprises on implementing AI at scale. You can check out Laurent’s newsletter[6]. You also can reach him on Linkedin[7].
Hassan Lâasri is an expert consultant and interim executive specializing in data and AI transformation, including regulations such as the European CSRD mandate and AI Act. His work involves collaborating with startups, growing ventures, and established firms to execute strategic initiatives, including audits, benchmarks, complex projects, and go-to-market strategies. You can check out Hassan’s blog[8]. You also can reach Hassan on Linkedin[9].
References
[1] https://www.thoughtworks.com/en-us/insights/reports/modernizing-data-with-strategic-purpose
[2] For a good explanation of how OpenAI, Gemini, Claude, Llama, Mistral and others work, see https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
[4] Federated learning is a distributed machine learning approach that enables multiple devices to collaboratively train a shared model without exchanging their local data.
[5] Differential privacy training is a technique that adds noise to the training process to protect the privacy of individual data points.
[6] https://substack.com/@laurentphilonenko
[7] https://www.linkedin.com/in/laurent-philonenko-3bab5/