Shadow Systems: The Data Infrastructure Nobody Wants to Talk About
This is the second post in our learning series on Modern Data Governance, leading up to a free webinar on April 2nd. Click here for webinar registration, and here to read the first post on data standards.
Over the last two decades (yikes), I’ve accumulated a lot of knowledge about how institutional data environments actually work.
A lot of people talk about data silos in higher education, but far fewer spend time thinking about the risks - and opportunities - of shadow systems.
Shadow systems are the tools and data environments that exist outside the officially supported institutional systems but still play an important role in how work gets done.
They can take many forms:
- Small databases built by individual departments
- Collections of spreadsheets used to track operational processes (often called “spreadmarts”)
- Shared folders containing extracts from enterprise systems
- Email chains that circulate reports and datasets
- Personal scripts or small applications built by analysts
- In some cases, even full data marts maintained by decentralized units
While some IT Services teams try their best to keep on top of shadow systems, the target is often moving. In many cases, beyond being able to log that a query was run against a central database, IT Services may have very little visibility into what happens next. The moment data leaves a central system, it often begins a new lifecycle: copied, transformed, filtered, and recombined in ways that are difficult to track.
Why Shadow Systems Exist
Before treating shadow systems purely as a problem, it’s worth acknowledging something important: shadow systems exist for a reason.
Institutional systems are not always designed to support the operational needs of very different stakeholders across a university. Shadow systems can offer speed, flexibility, and local context that centralized systems struggle to provide. Departments may need to capture additional fields, apply nuanced business rules, or respond quickly to new operational questions.
The users of these systems will often fight tooth and nail to keep them around, even when IT offers to port functionality into a central system. In many ways, shadow systems are a signal that people are trying to solve real problems with the tools available to them. But they also introduce risks.
Challenges Created by Shadow Systems
Institutional Data and Security Risk
If it isn’t clear how and where data is moving, there is always a chance that it ends up exposed to the wrong actors. We often see this with email attachments or shared folders, but the issue can also arise in less obvious places such as personal drives or unmanaged cloud storage, departmental databases running on aging infrastructure, or extracts stored indefinitely after their intended use. In highly regulated areas such as student, employee, financial, or research data, this can create significant institutional risk.
Data Quality Issues
Shadow systems can introduce data quality challenges in several ways. A shadow system may not be aware of changes made in central systems. Data extracts may become stale over time. Local merges and transformations may introduce subtle errors.
One common example is when two teams build reports using different spreadsheet extracts and slightly different definitions. Both may be internally consistent, but the numbers no longer match across the institution. Over time, analysts can end up spending more time reconciling reports than producing insights.
Data Governance Challenges
Shadow systems are rarely documented centrally. As a result, it may not be obvious who owns a dataset, business definitions may diverge across units, and institutional data may be duplicated across dozens of small environments. When this happens, governance becomes reactive rather than proactive. Institutions spend time chasing inconsistencies rather than improving how data supports decision-making.
Seeing the System That Actually Exists
One of the most interesting things about shadow systems is that they reveal something important: the official architecture diagram of institutional data is often very different from the real one.
Official diagrams typically show something like this:
System of Record → Data Warehouse → Reports
In practice, the environment often looks more like this:
System of Record → Warehouse → Extract → Spreadsheet → Local Database → Report → Email → Another Spreadsheet
Understanding this real ecosystem of data flows is critical for improving institutional data governance.
Illuminating Shadow Systems
One way institutions can begin to understand these environments is by mapping data lineage, including through to the shadow systems themselves. Data lineage tools help visualize how data moves between systems, transformations, and reports.
They allow institutions to answer questions such as:
- Where did this dataset originate?
- What transformations were applied to it?
- Which reports or analyses depend on it?
- Who is using it today?
This visibility can reveal both risks and opportunities. For example, institutions may identify where sensitive data is being replicated unnecessarily, discover widely used datasets that deserve better institutional support, or better understand the impact of making changes to data that downstream shadow systems rely upon.
Perhaps most importantly, lineage provides a bird’s-eye view of how data evolves over time, not just where it is stored.
Not sure where your institution falls on the static-to-live data governance spectrum? Register for the webinar to access the free "Where are you today? Static vs Active Governance" worksheet to find out.
From Shadow Systems to Living Governance
Shadow systems are not going away. Universities are complex, decentralized environments, and people will always create tools to solve immediate problems. With the rise of AI, more people will be empowered to create custom tools and applications that interact with institutional data.
The goal should not be to eliminate shadow systems entirely. Instead, institutions should aim to understand where they exist, reduce unnecessary risk, and support the most valuable ones properly.
Data lineage and modern data governance approaches can help institutions move toward what I sometimes describe as “living governance” - an evolving understanding of how data flows, transforms, and supports decisions across the institution.
When done well, governance stops being a static PDF document and becomes something much more useful: a living map of the institutional data ecosystem.
Webinar: Modern Data Governance is Live
Date: April 2nd
Time: 10:00am (PT)
Presented by: Andrew Drinkwater
Register here: Teams Webinar
If you're a data governance committee member, data steward, Registrar, IT leader, Dean, or manager interested in taking a modern approach to data governance, this webinar is for you.
All registrants will receive a free Active Data Governance Self-Assessment tool and Identity Lifecycle Mapping worksheet. By working through the assessment and worksheet (alongside reading this series), you'll not only have an understanding of where on the scale of static-to-live governance your institution is, but what gaps exist in a process map, and where shadow datasets and processes might exist. These sheets not only help you understand where your challenges are, but can give you the launch pad to take your concern back to your governance or leadership team.
In the webinar, Andrew will cover how static data dictionaries and handbooks decline in accuracy over time, and how live metadata increases accuracy in the long-term and makes your institution action-ready. We'll talk about efficiency, lineage mapping, why visual mapping data flows matters, and how data governance is key to scaling up your information technology and institutional research infrastructure. We'll show practical examples of how metadata management is used by IR, IT, HR, and/or Registration offices. Andrew will also take questions from the attendees at the end of the webinar.
The webinar will be recorded if you are unable to attend in person.