After writing about how data loses its origin, a question inevitably follows. If provenance matters so much, what does it actually look like in practice?

What is Provenance?

Provenance is not a technical term invented by data scientists. It comes from much older disciplines, from archives, museums, libraries, and art history. Long before the internet, provenance meant one simple thing: knowing where something comes from.

In a museum context, provenance answers questions like who created an object, when it was made, where it has been, and how it changed hands over time. In archives, it preserves the relationship between a document and the context in which it was created. Provenance is what allows us to distinguish an original from a copy, a record from an interpretation, a fact from a later embellishment.

In digital systems, provenance serves the same function, even if we often forget to name it.

When provenance is intact, data carries its own history. It tells us not only what it is, but how it came to be. When provenance is lost, data becomes an orphan. It still exists, it can still be copied, processed, and displayed, but it no longer knows where it belongs. This distinction matters more than it first appears.

A photograph with provenance is evidence. The same photograph without provenance is decoration. A genealogy entry with sources is a hypothesis that can be tested. The same entry without sources is a claim that invites repetition. Provenance does not guarantee correctness, but it makes correction possible.

The internet, however, was not built with provenance in mind. It was built for distribution. Files are detached from folders. Images are stripped of metadata. Text is quoted without citation. Context is treated as overhead, something that slows things down.

Over time, this has trained us to treat data as self explanatory. If it looks right, if it fits, if it appears often enough, we assume it must be true. Provenance becomes invisible precisely when we need it most.

Artificial intelligence inherits this environment. It does not see sources, only patterns. It does not know origin, only frequency. When provenance is missing, AI cannot compensate for it. It simply amplifies whatever remains.

Understanding provenance, then, is not about adding bureaucracy to data. It is about recognizing that information without history is unstable. It can be reshaped endlessly, confidently, and wrongly.

With that in mind, the question is no longer whether provenance is desirable. The question is how it can be practiced, deliberately and consistently, in systems that were never designed to preserve it. That is where practice begins.

What Provenance Actually Looks Like in Practice

The uncomfortable answer is that provenance is not a technology problem. It is not something that can be solved by adding a field, a checkbox, or a clever new standard. Provenance is a set of habits. And habits are cultural before they are technical.

Most failures of provenance do not happen because people intend to mislead. They happen because context is treated as optional. Because uncertainty feels untidy. Because filling in gaps is more satisfying than leaving them visible.

In practice, provenance begins with restraint.

One of the most important habits is knowing when not to add information. Not every photograph needs a precise date. Not every genealogy entry needs a profession. Not every archival record needs a complete narrative. “Unknown” is not a failure state. It is an honest one.

This sounds trivial, but it cuts against how most systems are designed. Databases encourage completeness. Platforms reward filled fields. Software nudges users to resolve ambiguity. Over time, absence starts to feel like error, even when it is the most accurate representation available.

Provenance means resisting that pressure.

Another practical habit is keeping sources attached to data as long as possible. Not in a separate document. Not in a comment that can be stripped away. As part of the object itself. A photograph without its source is not neutral. It is already halfway to becoming misleading.

This applies just as much to personal collections as to institutional archives. A filename that encodes nothing. A scan uploaded without context. A family tree entry copied without citation. Each small shortcut increases the chance that the next person will treat the fragment as self-explanatory.

Good provenance does not assume good faith downstream. It anticipates reuse.

Equally important is preserving uncertainty explicitly. Phrases like “probably,” “possibly,” or “family tradition says” are not weaknesses. They are signals. They tell future readers where interpretation begins and evidence ends. Removing them to make a story cleaner does not make it truer. It makes it fragile.

This is especially relevant for historical material. The farther back we go, the thinner the evidence becomes. Treating speculation as fact does not strengthen history. It erases the very process by which history is understood.

Deletion is another underrated tool.

In a culture that equates accumulation with value, deleting data feels wrong. But propagating unverifiable information is worse. If something cannot be sourced, cannot be contextualized, and cannot be marked as uncertain, removing it is often the most responsible choice.

This is true for family trees, image collections, and even code comments. Provenance is not just about preserving more. It is about preserving less, more carefully.

At an institutional level, provenance requires a shift in responsibility. Digitization is not neutral. When archives put material online, they are not just preserving it. They are enabling its reuse, reinterpretation, and absorption into other systems, including AI models.

That changes the ethical landscape. An error in a local archive once stayed local. An error in a digitized archive becomes global. Institutions can no longer treat attribution as a footnote. It is part of the core record.

This does not mean institutions must guarantee correctness. That is impossible. But it does mean they must surface uncertainty, document provenance clearly, and make corrections visible rather than silent.

Silently fixing a record may improve accuracy, but it destroys traceability. Good provenance includes the history of change.

For AI systems, the implications are even sharper. Training data is not raw material. It is inherited history. Every missing source, every collapsed uncertainty, every repeated error becomes part of the model’s internalized world.

Practicing provenance here means being selective about training data, weighting primary sources, and preserving metadata wherever possible. It also means acknowledging what cannot be known. An AI system that expresses uncertainty is not weaker. It is more honest.

Ultimately, provenance in practice is about slowing down in a culture optimized for speed.

It is about accepting that not knowing everything is preferable to knowing the wrong thing confidently. It is about treating context as part of the data, not an accessory. And it is about recognizing that trust is not generated by scale or polish, but by care.

None of this is glamorous. It does not produce neat outputs or viral results. But it does something more important. It preserves the connection between information and reality.

Without that connection, data may still circulate, but meaning will not survive. And once meaning is gone, no amount of intelligence, artificial or otherwise, can bring it back.