Research Write-up

[Proposal] AI Act Documentation Requirements

Fabian Linde
Harshvardhan J. Pandit

This research is funded by the European Union's Horizon 2020 programme under the Marie Skłodowska-Curie grant agreement No. 101169409 (HARNESS).

Note This is a proposal to the Data Privacy Vocabularies and Controls Community Group (DPVCG).

Motivation

Translating legal mandates of the EU AI Act into technical workflows requires shifting from manual paperwork to digital governance. By mapping documentation requirements to formal semantic concepts, organizations benefit from a structured, machine-interpretable framework. This ensures machine-readability by transforming legal text into metadata artefacts. This enables automation (such as continuous compliance tracking, algorithmic auditing, and real-time policy enforcement across the AI lifecycle), as well as interoperability (by establishing a unified, vendor-agnostic vocabulary that allows diverse development pipelines, enterprise tools, and regulatory platforms to exchange data without losing context).

Research Questions

RQ1 What informational elements need providers of high-risk AI systems / GPAI models / GPAI models with systemic risk compile to be compliant with the EU AI Act?
RQ2 Which of these informational elements are already adequately covered by existing documentation formats?
RQ3 How are information elements not sufficiently covered best represented in a machine-readable manner?

Contributions

1 List of core documentation requirements for High-Risk AI system and GPAI model providers in the EU AI Act
2 Gap analysis of coverage of AI Act documentation requirements by six common existing documentation formats
3 Machine-readable concepts extending DPV for documentation gaps

Previous Work

The central previous work in this direction is a paper by researchers at the European Commission's Joint Research Center, conducting a similar coverage analysis. However, multiple limitations applied to this work: first, it was published in 2023, before the adoption of the final version of the AI Act, lacking i.a. provisions for GPAI systems. Second, it only considered clusters of documentation elements, without a detailed analysis of each atomic requirement. Third, it focused only on non-semantic documentation formats.
"Datasheets for Datasets" is a foundational research paper that proposes a standardized framework for documenting the motivation, composition, collection process, and intended use of machine learning datasets. Inspired by practices in the electronics industry, this approach aims to increase transparency and accountability in AI by helping practitioners uncover potential biases and determine a dataset's suitability for specific applications.
Model Cards are a foundational framework that proposes short, standardized documents to accompany trained machine learning models, detailing their performance, intended use cases, and operational limitations. This practice enhances AI transparency and accountability by encouraging developers to benchmark models across diverse demographics and explicitly disclose potential biases or ethical risks before deployment.
ISO/IEC 42001 is the first international standard that specifies requirements for establishing, implementing, and continuously improving an Artificial Intelligence Management System (AIMS). It provides an auditable framework for organizations to govern their AI initiatives responsibly, specifically addressing challenges like algorithmic bias, system transparency, and continuous machine learning lifecycle management.
Croissant is an open, standardized metadata format developed by MLCommons that provides a machine-readable vocabulary for describing machine learning datasets. By bridging dataset documentation with major ML frameworks (like PyTorch, TensorFlow, and JAX), it enables seamless data loading, improved search discoverability, and standardized tracking of Responsible AI (RAI) metadata such as data provenance and licensing.
MLDCAT-AP (Machine Learning Dataset Catalog Application Profile) is an extension of the European W3C DCAT-AP standard designed specifically for cataloging and sharing machine learning resources. It provides a standardized metadata schema to describe ML models, tasks, algorithms, and training datasets together, facilitating seamless discovery, automated harvesting, and cross-platform interoperability across AI repositories like OpenML and Hugging Face.
The Data Privacy Vocabulary (DPV) is a standardized, machine-readable ontology and taxonomy developed by a W3C Community Group to describe the processing of personal and non-personal data. It provides a structured data model to document metadata such as data categories, processing purposes, legal bases, safety measures, and technical risks, making it easier for systems to ensure and demonstrate compliance with regulations like the GDPR and the EU AI Act.

Scope

AI Act

Art. 9 Art. 10 Art. 12 Art. 13 Art. 14 Art. 15 Art. 17 Annex IV Annex V Annex XI Annex XII

Documentation Formats

Datasheets Model Cards ISO 42001 Croissant MLDCAT-AP DPV

Methodology

Step 01

Vocabulary Requirements Specification
Requirements regarding documentation of high-risk AI systems are extracted from the AI Act.

Step 02

Gap Analysis
Extracted requirements are compared against existing machine-readable and non-machine-readable documentation resources.

Step 03

Concept Creation
Concepts created in an iterative process to allow broad coverage of gaps while ensuring an adequate fit into existing DPV structures.

Step 04

Vocabulary Publication
Concepts proposed for publication to the DPVCG.

Documentation Requirements

285 Documentation
requirements identified

Documentation requirements have been grouped regarding whether they are of a technical or organisational nature, as well as regarding their level based on the taxonomy of transparency in AI in ISO 12792.

By type
Technical
190
Organisational
83
Both
12
By level (ISO 12792)
Context
16
System
151
Model
22
Dataset
96

Coverage by Documentation Format

Documentation requirements have been analysed regarding their coverage by different existing documentation formats.

For each documentation requirement and and each documentation format, the the coverage has been assessed qualitatively on a discrete scale from 0 to 2, where each score represents the following:

Score 0: Requirement not covered by the format
Score 1: Requirement partially covered or implicitly addressable
Score 2: Requirement fully or explicitly covered

The following table shows how many documentation requirements received a score of 0, 1 and 2 for each documentation format:

Format Score 0 Score 1 Score 2 Avg. Score
Datasheets 225 12 48
0.37
Model Cards 162 100 23
0.51
ISO 42001 94 79 112
1.06
Croissant 248 10 27
0.22
MLDCAT-AP 208 63 14
0.31
DPV 101 107 77
0.91

This makes obvious that none of the existing documentation formats fully cover all the documentation required by the AI Act. ISO 42001 comes closest, but is not a machine-readable resource. Among semantic formats (MLDAT-AP, DPV, Croissant), DPV scores highest, and is therefore selected as the framework for the development of concepts to address this gap.

New Concepts Proposed to DPV

43 Total concepts
proposed
33
Classes
10
Properties
dpv: Data Privacy Vocabulary core 15 classes
Classes
dpv:DataQualityStatus
DefinitionThe outcome of a data quality assessment representing its suitability or acceptability for use or consideration in a process or for a specific context.
Parentdpv:Status
PropertyhasDataQualityStatus
Instances
dpv:DataQualityAcceptabledpv:DataQualityUnacceptable
dpv:DataAvailabilityAssessmentArt. 10(2)(e)
DefinitionAssessing the availability of data necessary for a given use case.
Parentdpv:DataQualityAssessment
Object ofdpv:hasAssessment
SourceEU AI Act, Art. 10(2)(e)
dpv:DataAvailabilityStatusArt. 10(2)(e)
DefinitionThe outcome of a data availability assessment.
Parentdpv:DataQualityStatus
Instances
DataAvailabilityCompletelyAvailableDataAvailabilityPartiallyAvailableDataAvailabilityNotAvailableDataAvailabilityUnknown
SourceEU AI Act, Art. 10(2)(e)
dpv:DataQuantityAssessmentArt. 10(2)(e)
DefinitionAssessing whether the quantity of data available is sufficient for a given use case.
Parentdpv:DataQualityAssessment
Object ofdpv:hasAssessment
SourceEU AI Act, Art. 10(2)(e)
dpv:DataQuantityStatusArt. 10(2)(e)
DefinitionThe outcome of a data quantity assessment.
Parentdpv:DataQualityStatus
Instances
DataQuantitySufficientDataQuantityNotSufficientDataQuantityUnknown
SourceEU AI Act, Art. 10(2)(e)
dpv:DataSuitabilityAssessmentArt. 10(2)(e)
DefinitionAssessing whether the data is suitable for a given use case, i.e. does the data hold the appropriate distribution for the foreseen application.
Parentdpv:DataQualityAssessment
Object ofdpv:hasAssessment
SourceEU AI Act, Art. 10(2)(e)
dpv:DataSuitabilityStatusArt. 10(2)(e)
DefinitionThe outcome of a data suitability assessment.
Parentdpv:DataQualityStatus
Instances
DataSuitabilityCompletelySuitableDataSuitabilityPartiallySuitableDataSuitabilityNotSuitableDataSuitabilityUnknown
SourceEU AI Act, Art. 10(2)(e)
dpv:DataRelevanceAssessmentArt. 10(3)
DefinitionAssessing whether the data is relevant for a given use case.
Parentdpv:DataQualityAssessment
Object ofdpv:hasAssessment
SourceEU AI Act, Art. 10(3)
dpv:DataRelevanceStatusArt. 10(3)
DefinitionThe outcome of a data relevance assessment.
Parentdpv:DataQualityStatus
Instances
DataRelevantCompletelyDataRelevantPartiallyDataNotRelevantDataRelevanceUnknown
SourceEU AI Act, Art. 10(3)
dpv:DataContextualSuitabilityAssessmentArt. 10(4)
DefinitionAssessing whether the data is suitable for the geographical, contextual, behavioural or functional context of a given use case, i.e. does the data hold the appropriate distribution or representativeness for the foreseen application, for example as regards the persons or groups of persons in relation to whom the technology is intended to be used.
Parentdpv:DataQualityAssessment
Object ofdpv:hasAssessment
SourceEU AI Act, Art. 10(4)
dpv:DataContextualSuitabilityStatusArt. 10(4)
DefinitionThe outcome of a data contextual suitability assessment.
Parentdpv:DataQualityStatus
Instances
DataContextuallySuitableDataContextuallyNotSuitableDataContextualSuitabilityUnknown
SourceEU AI Act, Art. 10(4)
dpv:DataCorrectnessAssessmentArt. 10(3)
DefinitionAssessing whether the data is free of errors.
Parentdpv:DataQualityAssessment
Object ofdpv:hasAssessment
SourceEU AI Act, Art. 10(3)
dpv:DataCorrectnessStatusArt. 10(3)
DefinitionThe outcome of a data correctness assessment.
Parentdpv:DataQualityStatus
Instances
DataCorrectCompletelyDataCorrectPartiallyDataNotCorrectDataCorrectnessUnknown
SourceEU AI Act, Art. 10(3)
dpv:DataCompletenessAssessmentArt. 10(3)
DefinitionAssessing whether the data is complete.
Parentdpv:DataQualityAssessment
Object ofdpv:hasAssessment
SourceEU AI Act, Art. 10(3)
dpv:DataCompletenessStatusArt. 10(3)
DefinitionThe outcome of a data completeness assessment.
Parentdpv:DataQualityStatus
Instances
DataCompleteDataNotCompleteDataCompletenessUnknown
SourceEU AI Act, Art. 10(3)
eu-aiact: EU AI Act extension 8 classes
Classes
eu-aiact:TrainingForDeployerArt. 9(5)(c)
DefinitionProvision of training to deployers.
Parentdpv:OrganisationalMeasure
Object ofdpv:hasOrganisationalMeasure
SourceEU AI Act, Art. 9(5)(c)
eu-aiact:EnsuringDeployerSuitabilityArt. 9(5)
DefinitionConsideration of the technical knowledge, experience, education and training to be expected by the deployer.
Parentdpv:OrganisationalMeasure
SourceEU AI Act, Art. 9(5)
eu-aiact:DataGapsPreventingComplianceArt. 10(2)(h)
DefinitionIdentification of relevant data gaps or shortcomings that prevent compliance with the AI Act.
Parentrisk:LegalComplianceRisk
SourceEU AI Act, Art. 10(2)(h)
eu-aiact:CommunicationWithAuthorityArt. 17(1)(j)
DefinitionCommunication with relevant (including national competent) authorities.
Parentdpv:GovernanceProcedures
SourceEU AI Act, Art. 17(1)(j)
eu-aiact:HumanOversightManagementArt. 14(4)
DefinitionProvision of an AI system to the deployer such that persons assigned human oversight can understand, interpret, disregard, override or reverse outputs; duly monitor the system; be aware of (over-)reliance; and intervene in the operation of the AI system.
Parentdpv:OrganisationalMeasure
SourceEU AI Act, Art. 14(4)
eu-aiact:A10-5Art. 10(5)
DefinitionLegal basis permitting the exceptional processing of special categories of personal data for bias detection/correction, to the extent it is strictly necessary.
Parentdpv:LegalBasis
SourceEU AI Act, Art. 10(5)
eu-aiact:SpecialCategoryBiasExemptionAssessmentArt. 10(5)
Parentdpv:Assessment
SourceEU AI Act, Art. 10(5)
eu-aiact:SpecialCategoryBiasExemptionStatusArt. 10(5)
Instances
SpecialCategoryBiasExemptionPermittedSpecialCategoryBiasExemptionNotPermittedSpecialCategoryBiasExemptionUnknown
SourceEU AI Act, Art. 10(5)
ai: AI extension 5 classes · 1 property
Classes
ai:ModelTesting
DefinitionMethod to evaluate AI models.
Parentdpv:TechnicalMeasure
Object ofdpv:hasTechnicalMeasure
ai:AdversarialModelTestingAnnex XI §2(2)
DefinitionMethod to evaluate (generative) AI models by intentionally providing malicious inputs to identify vulnerabilities.
Parentai:ModelTesting
Object ofdpv:hasTechnicalMeasure
SourceEU AI Act, Annex XI Section 2(2)
ai:ParameterCountAnnex XI §1(1)(d); Annex XII(1)(f)
DefinitionThe number of parameters in an AI model.
Object ofai:hasParameterCount
SourceEU AI Act, Annex XI Section 1(1)(d); Annex XII(1)(f)
ai:DataCurationAnnex XI §1(2)(c)
DefinitionOrganisation and integration of data collected from various sources.
Parentdpv:DataGovernance
Object ofdpv:hasTechnicalOrganisationalMeasure
SourceEU AI Act, Annex XI Section 1(2)(c)
ai:DataSelectionAnnex XI §1(2)(c)
DefinitionThe selection of the dataset for a given purpose.
Parentai:DataOperation
SourceEU AI Act, Annex XI Section 1(2)(c)
Properties
ai:hasParameterCount
Domainai:Model
Rangeai:ParameterCount
tech: Technology extension 4 classes · 9 properties
Classes
tech:FirmwareAnnex IV(1)(c)
DefinitionSpecialised software integrated into hardware devices to control their basic functions and facilitate communication between hardware and higher-level software.
Parentdpv:Technology
Object oftech:hasFirmware
SourceEU AI Act, Annex IV(1)(c)
tech:UpdateRequirementsAnnex IV(1)(c)
DefinitionRequirements to be met to ensure successful updates to newer software versions.
Parenttech:Instructions
Object oftech:hasUpdateRequirements
SourceEU AI Act, Annex IV(1)(c)
tech:UserInterfaceAnnex IV(1)(g)
DefinitionSystem that enables interaction between users and machines.
Object oftech:hasUserInterface
SourceEU AI Act, Annex IV(1)(g)
tech:EnergyConsumptionAnnex XI §1(2)(e)
DefinitionThe energy consumed by an AI system, potentially in a particular stage of its lifecycle.
Parenteu-aiact:ComputationalResource
SourceEU AI Act, Annex XI Section 1(2)(e)
Properties
tech:hasFirmwareAnnex IV(1)(c)
Rangetech:Firmware
SourceEU AI Act, Annex IV(1)(c)
tech:hasVersion
RangeLiteral
tech:hasUpdateRequirementsAnnex IV(1)(c)
Rangetech:UpdateRequirements
SourceEU AI Act, Annex IV(1)(c)
tech:hasUserInterfaceAnnex IV(1)(g)
Rangetech:UserInterface
SourceEU AI Act, Annex IV(1)(g)
tech:hasKnownEnergyConsumptionAnnex XI §1(2)(e)
DefinitionThe energy known to be consumed by a system.
Rangetech:EnergyConsumption
SourceEU AI Act, Annex XI Section 1(2)(e)
tech:hasEstimatedEnergyConsumptionAnnex XI §1(2)(e)
DefinitionThe energy estimated to be consumed by a system.
Rangetech:EnergyConsumption
SourceEU AI Act, Annex XI Section 1(2)(e)
tech:hasModality
Rangetech:Content
tech:hasOperatingInteractionAnnex IV(1)(b)
Rangedpv:Technology
SourceEU AI Act, Annex IV(1)(b)
tech:hasExpectedLifetimeArt. 13(3)(e)
Domaindpv:Technology
Rangedpv:Duration
SourceEU AI Act, Art. 13(3)(e)
justifications: Justifications extension 1 class
Classes
justifications:AssumptionArt. 10(2)(d)
DefinitionAssumptions that clarify or provide more information about the context.
Parentdpv:Justification
Usage noteWhen used regarding data, expresses information that data is supposed to measure and represent.
SourceEU AI Act, Art. 10(2)(d)

References

  1. Hupont, I., Micheli, M., Delipetrev, B., Gómez, E., Garrido, J.S. Documenting High-Risk AI: A European Regulatory Perspective. Computer, 56(5):18–27. IEEE, 2023. 10.1109/MC.2023.3235712
  2. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daumé III, H., Crawford, K. Datasheets for Datasets. Communications of the ACM, 64(12):86–92, 2021. 10.1145/3458723
  3. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T. Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). ACM, 2019. 10.1145/3287560.3287596
  4. ISO/IEC. ISO/IEC 42001:2023 — Information Technology: Artificial Intelligence Management System. International Organization for Standardization, Geneva, 2023. iso.org/standard/42001
  5. Akhtar, M., Benjelloun, O., Conforti, C., et al. Croissant: A Metadata Format for ML-Ready Datasets. In: Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track, 2024. arXiv:2403.19546
  6. SEMIC / Interoperable Europe. MLDCAT-AP: Machine Learning DCAT Application Profile, v3.0.0. European Commission Semantic Interoperability Community (SEMIC), 2025. semiceu.github.io/MLDCAT-AP
  7. Pandit, H.J., Esteves, B., Krog, G.P., Ryan, P., Golpayegani, D., Flake, J. Data Privacy Vocabulary (DPV) — Version 2.0. In: The Semantic Web — ISWC 2024. Lecture Notes in Computer Science, vol 15233. Springer, 2024. 10.1007/978-3-031-77847-6_10
  8. European Parliament and Council of the EU. Regulation (EU) 2024/1689 of 13 June 2024 laying down harmonised rules on Artificial Intelligence (AI Act). Official Journal of the European Union, L 2024/1689, 2024. EUR-Lex