[Proposal] AI Act Documentation Requirements

Background

Motivation

Translating legal mandates of the EU AI Act into technical workflows requires shifting from manual paperwork to digital governance. By mapping documentation requirements to formal semantic concepts, organizations benefit from a structured, machine-interpretable framework. This ensures machine-readability by transforming legal text into metadata artefacts. This enables automation (such as continuous compliance tracking, algorithmic auditing, and real-time policy enforcement across the AI lifecycle), as well as interoperability (by establishing a unified, vendor-agnostic vocabulary that allows diverse development pipelines, enterprise tools, and regulatory platforms to exchange data without losing context).

Inquiry

Research Questions

RQ1 What informational elements need providers of high-risk AI systems / GPAI models / GPAI models with systemic risk compile to be compliant with the EU AI Act?

RQ2 Which of these informational elements are already adequately covered by existing documentation formats?

RQ3 How are information elements not sufficiently covered best represented in a machine-readable manner?

Output

Contributions

1 List of core documentation requirements for High-Risk AI system and GPAI model providers in the EU AI Act

2 Gap analysis of coverage of AI Act documentation requirements by six common existing documentation formats

3 Machine-readable concepts extending DPV for documentation gaps

Background

Previous Work

JRC Paper[1]

The central previous work in this direction is a paper by researchers at the European Commission's Joint Research Center, conducting a similar coverage analysis. However, multiple limitations applied to this work: first, it was published in 2023, before the adoption of the final version of the AI Act, lacking i.a. provisions for GPAI systems. Second, it only considered clusters of documentation elements, without a detailed analysis of each atomic requirement. Third, it focused only on non-semantic documentation formats.

Datasheets[2]

"Datasheets for Datasets" is a foundational research paper that proposes a standardized framework for documenting the motivation, composition, collection process, and intended use of machine learning datasets. Inspired by practices in the electronics industry, this approach aims to increase transparency and accountability in AI by helping practitioners uncover potential biases and determine a dataset's suitability for specific applications.

Model Cards[3]

Model Cards are a foundational framework that proposes short, standardized documents to accompany trained machine learning models, detailing their performance, intended use cases, and operational limitations. This practice enhances AI transparency and accountability by encouraging developers to benchmark models across diverse demographics and explicitly disclose potential biases or ethical risks before deployment.

ISO 42001[4]

ISO/IEC 42001 is the first international standard that specifies requirements for establishing, implementing, and continuously improving an Artificial Intelligence Management System (AIMS). It provides an auditable framework for organizations to govern their AI initiatives responsibly, specifically addressing challenges like algorithmic bias, system transparency, and continuous machine learning lifecycle management.

Croissant[5]

Croissant is an open, standardized metadata format developed by MLCommons that provides a machine-readable vocabulary for describing machine learning datasets. By bridging dataset documentation with major ML frameworks (like PyTorch, TensorFlow, and JAX), it enables seamless data loading, improved search discoverability, and standardized tracking of Responsible AI (RAI) metadata such as data provenance and licensing.

MLDCAT-AP[6]

MLDCAT-AP (Machine Learning Dataset Catalog Application Profile) is an extension of the European W3C DCAT-AP standard designed specifically for cataloging and sharing machine learning resources. It provides a standardized metadata schema to describe ML models, tasks, algorithms, and training datasets together, facilitating seamless discovery, automated harvesting, and cross-platform interoperability across AI repositories like OpenML and Hugging Face.

DPV[7]

The Data Privacy Vocabulary (DPV) is a standardized, machine-readable ontology and taxonomy developed by a W3C Community Group to describe the processing of personal and non-personal data. It provides a structured data model to document metadata such as data categories, processing purposes, legal bases, safety measures, and technical risks, making it easier for systems to ensure and demonstrate compliance with regulations like the GDPR and the EU AI Act.

Coverage

Scope

AI Act

Documentation Formats

Process

Methodology

Step 01

Vocabulary Requirements Specification
Requirements regarding documentation of high-risk AI systems are extracted from the AI Act.

Step 02

Gap Analysis
Extracted requirements are compared against existing machine-readable and non-machine-readable documentation resources.

Step 03

Concept Creation
Concepts created in an iterative process to allow broad coverage of gaps while ensuring an adequate fit into existing DPV structures.

Step 04

Vocabulary Publication
Concepts proposed for publication to the DPVCG.

Result

Documentation Requirements

285 Documentation
requirements identified

Documentation requirements have been grouped regarding whether they are of a technical or organisational nature, as well as regarding their level based on the taxonomy of transparency in AI in ISO 12792.

By type

Technical

190

Organisational

Both

By level (ISO 12792)

Context

System

151

Model

Dataset

Result

Coverage by Documentation Format

Documentation requirements have been analysed regarding their coverage by different existing documentation formats.

For each documentation requirement and and each documentation format, the the coverage has been assessed qualitatively on a discrete scale from 0 to 2, where each score represents the following:

Score 0: Requirement not covered by the format

Score 1: Requirement partially covered or implicitly addressable

Score 2: Requirement fully or explicitly covered

The following table shows how many documentation requirements received a score of 0, 1 and 2 for each documentation format:

Format	Score 0	Score 1	Score 2	Avg. Score
Datasheets	225	12	48	0.37
Model Cards	162	100	23	0.51
ISO 42001	94	79	112	1.06
Croissant	248	10	27	0.22
MLDCAT-AP	208	63	14	0.31
DPV	101	107	77	0.91

This makes obvious that none of the existing documentation formats fully cover all the documentation required by the AI Act. ISO 42001 comes closest, but is not a machine-readable resource. Among semantic formats (MLDAT-AP, DPV, Croissant), DPV scores highest, and is therefore selected as the framework for the development of concepts to address this gap.

Result

New Concepts Proposed to DPV

43 Total concepts
proposed

Classes

Properties

dpv: Data Privacy Vocabulary core 15 classes

Classes

dpv:DataQualityStatus

DefinitionThe outcome of a data quality assessment representing its suitability or acceptability for use or consideration in a process or for a specific context.

Parentdpv:Status

PropertyhasDataQualityStatus

Instances

dpv:DataQualityAcceptabledpv:DataQualityUnacceptable

dpv:DataAvailabilityAssessmentArt. 10(2)(e)

DefinitionAssessing the availability of data necessary for a given use case.

Parentdpv:DataQualityAssessment

Object ofdpv:hasAssessment

SourceEU AI Act, Art. 10(2)(e)

dpv:DataAvailabilityStatusArt. 10(2)(e)

DefinitionThe outcome of a data availability assessment.

Parentdpv:DataQualityStatus

Instances

DataAvailabilityCompletelyAvailableDataAvailabilityPartiallyAvailableDataAvailabilityNotAvailableDataAvailabilityUnknown

SourceEU AI Act, Art. 10(2)(e)

dpv:DataQuantityAssessmentArt. 10(2)(e)

DefinitionAssessing whether the quantity of data available is sufficient for a given use case.

Parentdpv:DataQualityAssessment

Object ofdpv:hasAssessment

SourceEU AI Act, Art. 10(2)(e)

dpv:DataQuantityStatusArt. 10(2)(e)

DefinitionThe outcome of a data quantity assessment.

Parentdpv:DataQualityStatus

Instances

DataQuantitySufficientDataQuantityNotSufficientDataQuantityUnknown

SourceEU AI Act, Art. 10(2)(e)

dpv:DataSuitabilityAssessmentArt. 10(2)(e)

DefinitionAssessing whether the data is suitable for a given use case, i.e. does the data hold the appropriate distribution for the foreseen application.

Parentdpv:DataQualityAssessment

Object ofdpv:hasAssessment

SourceEU AI Act, Art. 10(2)(e)

dpv:DataSuitabilityStatusArt. 10(2)(e)

DefinitionThe outcome of a data suitability assessment.

Parentdpv:DataQualityStatus

Instances

DataSuitabilityCompletelySuitableDataSuitabilityPartiallySuitableDataSuitabilityNotSuitableDataSuitabilityUnknown

SourceEU AI Act, Art. 10(2)(e)

dpv:DataRelevanceAssessmentArt. 10(3)

DefinitionAssessing whether the data is relevant for a given use case.

Parentdpv:DataQualityAssessment

Object ofdpv:hasAssessment

SourceEU AI Act, Art. 10(3)

dpv:DataRelevanceStatusArt. 10(3)

DefinitionThe outcome of a data relevance assessment.

Parentdpv:DataQualityStatus

Instances

DataRelevantCompletelyDataRelevantPartiallyDataNotRelevantDataRelevanceUnknown

SourceEU AI Act, Art. 10(3)

dpv:DataContextualSuitabilityAssessmentArt. 10(4)

DefinitionAssessing whether the data is suitable for the geographical, contextual, behavioural or functional context of a given use case, i.e. does the data hold the appropriate distribution or representativeness for the foreseen application, for example as regards the persons or groups of persons in relation to whom the technology is intended to be used.

Parentdpv:DataQualityAssessment

Object ofdpv:hasAssessment

SourceEU AI Act, Art. 10(4)

dpv:DataContextualSuitabilityStatusArt. 10(4)

DefinitionThe outcome of a data contextual suitability assessment.

Parentdpv:DataQualityStatus

Instances

DataContextuallySuitableDataContextuallyNotSuitableDataContextualSuitabilityUnknown

SourceEU AI Act, Art. 10(4)

dpv:DataCorrectnessAssessmentArt. 10(3)

DefinitionAssessing whether the data is free of errors.

Parentdpv:DataQualityAssessment

Object ofdpv:hasAssessment

SourceEU AI Act, Art. 10(3)

dpv:DataCorrectnessStatusArt. 10(3)

DefinitionThe outcome of a data correctness assessment.

Parentdpv:DataQualityStatus

Instances

DataCorrectCompletelyDataCorrectPartiallyDataNotCorrectDataCorrectnessUnknown

SourceEU AI Act, Art. 10(3)

dpv:DataCompletenessAssessmentArt. 10(3)

DefinitionAssessing whether the data is complete.

Parentdpv:DataQualityAssessment

Object ofdpv:hasAssessment

SourceEU AI Act, Art. 10(3)

dpv:DataCompletenessStatusArt. 10(3)

DefinitionThe outcome of a data completeness assessment.

Parentdpv:DataQualityStatus

Instances

DataCompleteDataNotCompleteDataCompletenessUnknown

SourceEU AI Act, Art. 10(3)

eu-aiact: EU AI Act extension 8 classes

Classes

eu-aiact:TrainingForDeployerArt. 9(5)(c)

DefinitionProvision of training to deployers.

Parentdpv:OrganisationalMeasure

Object ofdpv:hasOrganisationalMeasure

SourceEU AI Act, Art. 9(5)(c)

eu-aiact:EnsuringDeployerSuitabilityArt. 9(5)

DefinitionConsideration of the technical knowledge, experience, education and training to be expected by the deployer.

Parentdpv:OrganisationalMeasure

SourceEU AI Act, Art. 9(5)

eu-aiact:DataGapsPreventingComplianceArt. 10(2)(h)

DefinitionIdentification of relevant data gaps or shortcomings that prevent compliance with the AI Act.

Parentrisk:LegalComplianceRisk

SourceEU AI Act, Art. 10(2)(h)

eu-aiact:CommunicationWithAuthorityArt. 17(1)(j)

DefinitionCommunication with relevant (including national competent) authorities.

Parentdpv:GovernanceProcedures

SourceEU AI Act, Art. 17(1)(j)

eu-aiact:HumanOversightManagementArt. 14(4)

DefinitionProvision of an AI system to the deployer such that persons assigned human oversight can understand, interpret, disregard, override or reverse outputs; duly monitor the system; be aware of (over-)reliance; and intervene in the operation of the AI system.

Parentdpv:OrganisationalMeasure

SourceEU AI Act, Art. 14(4)

eu-aiact:A10-5Art. 10(5)

DefinitionLegal basis permitting the exceptional processing of special categories of personal data for bias detection/correction, to the extent it is strictly necessary.

Parentdpv:LegalBasis

SourceEU AI Act, Art. 10(5)

eu-aiact:SpecialCategoryBiasExemptionAssessmentArt. 10(5)

Parentdpv:Assessment

SourceEU AI Act, Art. 10(5)

eu-aiact:SpecialCategoryBiasExemptionStatusArt. 10(5)

Instances

SpecialCategoryBiasExemptionPermittedSpecialCategoryBiasExemptionNotPermittedSpecialCategoryBiasExemptionUnknown

SourceEU AI Act, Art. 10(5)

ai: AI extension 5 classes · 1 property

Classes

ai:ModelTesting

DefinitionMethod to evaluate AI models.

Parentdpv:TechnicalMeasure

Object ofdpv:hasTechnicalMeasure

ai:AdversarialModelTestingAnnex XI §2(2)

DefinitionMethod to evaluate (generative) AI models by intentionally providing malicious inputs to identify vulnerabilities.

Parentai:ModelTesting

Object ofdpv:hasTechnicalMeasure

SourceEU AI Act, Annex XI Section 2(2)

ai:ParameterCountAnnex XI §1(1)(d); Annex XII(1)(f)

DefinitionThe number of parameters in an AI model.

Object ofai:hasParameterCount

SourceEU AI Act, Annex XI Section 1(1)(d); Annex XII(1)(f)

ai:DataCurationAnnex XI §1(2)(c)

DefinitionOrganisation and integration of data collected from various sources.

Parentdpv:DataGovernance

Object ofdpv:hasTechnicalOrganisationalMeasure

SourceEU AI Act, Annex XI Section 1(2)(c)

ai:DataSelectionAnnex XI §1(2)(c)

DefinitionThe selection of the dataset for a given purpose.

Parentai:DataOperation

SourceEU AI Act, Annex XI Section 1(2)(c)

Properties

ai:hasParameterCount

Domainai:Model

Rangeai:ParameterCount

tech: Technology extension 4 classes · 9 properties

Classes

tech:FirmwareAnnex IV(1)(c)

DefinitionSpecialised software integrated into hardware devices to control their basic functions and facilitate communication between hardware and higher-level software.

Parentdpv:Technology

Object oftech:hasFirmware

SourceEU AI Act, Annex IV(1)(c)

tech:UpdateRequirementsAnnex IV(1)(c)

DefinitionRequirements to be met to ensure successful updates to newer software versions.

Parenttech:Instructions

Object oftech:hasUpdateRequirements

SourceEU AI Act, Annex IV(1)(c)

tech:UserInterfaceAnnex IV(1)(g)

DefinitionSystem that enables interaction between users and machines.

Object oftech:hasUserInterface

SourceEU AI Act, Annex IV(1)(g)

tech:EnergyConsumptionAnnex XI §1(2)(e)

DefinitionThe energy consumed by an AI system, potentially in a particular stage of its lifecycle.

Parenteu-aiact:ComputationalResource

SourceEU AI Act, Annex XI Section 1(2)(e)

Properties

tech:hasFirmwareAnnex IV(1)(c)

Rangetech:Firmware

SourceEU AI Act, Annex IV(1)(c)

tech:hasVersion

RangeLiteral

tech:hasUpdateRequirementsAnnex IV(1)(c)

Rangetech:UpdateRequirements

SourceEU AI Act, Annex IV(1)(c)

tech:hasUserInterfaceAnnex IV(1)(g)

Rangetech:UserInterface

SourceEU AI Act, Annex IV(1)(g)

tech:hasKnownEnergyConsumptionAnnex XI §1(2)(e)

DefinitionThe energy known to be consumed by a system.

Rangetech:EnergyConsumption

SourceEU AI Act, Annex XI Section 1(2)(e)

tech:hasEstimatedEnergyConsumptionAnnex XI §1(2)(e)

DefinitionThe energy estimated to be consumed by a system.

Rangetech:EnergyConsumption

SourceEU AI Act, Annex XI Section 1(2)(e)

tech:hasModality

Rangetech:Content

tech:hasOperatingInteractionAnnex IV(1)(b)

Rangedpv:Technology

SourceEU AI Act, Annex IV(1)(b)

tech:hasExpectedLifetimeArt. 13(3)(e)

Domaindpv:Technology

Rangedpv:Duration

SourceEU AI Act, Art. 13(3)(e)

justifications: Justifications extension 1 class

Classes

justifications:AssumptionArt. 10(2)(d)

DefinitionAssumptions that clarify or provide more information about the context.

Parentdpv:Justification

Usage noteWhen used regarding data, expresses information that data is supposed to measure and represent.

SourceEU AI Act, Art. 10(2)(d)

Bibliography

References

Hupont, I., Micheli, M., Delipetrev, B., Gómez, E., Garrido, J.S. Documenting High-Risk AI: A European Regulatory Perspective. Computer, 56(5):18–27. IEEE, 2023. 10.1109/MC.2023.3235712
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daumé III, H., Crawford, K. Datasheets for Datasets. Communications of the ACM, 64(12):86–92, 2021. 10.1145/3458723
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T. Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). ACM, 2019. 10.1145/3287560.3287596
ISO/IEC. ISO/IEC 42001:2023 — Information Technology: Artificial Intelligence Management System. International Organization for Standardization, Geneva, 2023. iso.org/standard/42001
Akhtar, M., Benjelloun, O., Conforti, C., et al. Croissant: A Metadata Format for ML-Ready Datasets. In: Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track, 2024. arXiv:2403.19546
SEMIC / Interoperable Europe. MLDCAT-AP: Machine Learning DCAT Application Profile, v3.0.0. European Commission Semantic Interoperability Community (SEMIC), 2025. semiceu.github.io/MLDCAT-AP
Pandit, H.J., Esteves, B., Krog, G.P., Ryan, P., Golpayegani, D., Flake, J. Data Privacy Vocabulary (DPV) — Version 2.0. In: The Semantic Web — ISWC 2024. Lecture Notes in Computer Science, vol 15233. Springer, 2024. 10.1007/978-3-031-77847-6_10
European Parliament and Council of the EU. Regulation (EU) 2024/1689 of 13 June 2024 laying down harmonised rules on Artificial Intelligence (AI Act). Official Journal of the European Union, L 2024/1689, 2024. EUR-Lex