Key Takeaways
- August 2, 2026 is the EU AI Act enforcement deadline for high-risk AI systems — Article 10 data governance requirements are not optional
- Article 10 has 7 distinct technical requirements: relevance, representativeness, error-freedom, completeness, statistical properties, bias examination, and provenance documentation
- Documentation must be continuous — retrofitting data cards after a model is built rarely satisfies auditors
- GDPR data minimization and Article 10 representativeness create a genuine legal tension; consent-based collection and anonymization are the practical resolution
- For GDPR-compliant, Article 10-ready training data with full documentation packages, see our [speech data collection services](/speech-data/)
August 2, 2026. That is the date when EU AI Act enforcement begins for high-risk AI systems. If you are building automotive driver assistance systems, medical imaging tools, employment screening algorithms, or any other system covered under Annex III, Article 10 is not an abstract legal concern. It is a set of engineering requirements with a hard deadline.
Big 4 consulting firms are producing excellent white papers explaining what Article 10 means for executives. This article is different. It explains what Article 10 means for the ML engineer who has to actually implement it — what data to collect, how to document it, how to examine it for bias, and what a regulator will look for if they audit you.
No legal jargon. Concrete checklists, templates, and the specific mistakes that cause audit failures.
What Article 10 Actually Requires
Article 10 of the EU AI Act is titled “Data and Data Governance.” It applies to any high-risk AI system as defined in Annex III — which covers a wide range of systems including biometric identification, critical infrastructure management, education and vocational training tools, employment and worker management, access to essential services, law enforcement, migration control, and administration of justice.
The text of Article 10 contains seven core requirements, paraphrased here with their engineering implications:
1. Data must be relevant to the intended purpose (Art. 10(2)(a))
Your training data must correspond to the actual task your system performs in deployment. An automotive NLU system trained primarily on call center transcripts is not using relevant data. You must document the intended purpose and show that your dataset directly supports it — not a tangentially related task.
2. Sufficiently representative (Art. 10(3))
This is where most teams underestimate the requirement. “Representative” does not mean balanced in the naive sense of equal class distribution. It means statistically covering the population the system will be applied to, including edge cases, regional variants, demographic subgroups, and uncommon but operationally critical scenarios.
For a speech recognition system targeting German-speaking Europe, “representative” means covering not just Hochdeutsch but Austrian and Swiss German dialects, age-related speech patterns, speakers with accents, and elderly speakers. For a medical imaging classifier, it means including imaging from different equipment manufacturers, patient populations with different skin tones, and disease presentations across demographic groups.
The technical approach is stratified sampling: defining the strata in advance based on known variance dimensions, then sampling proportionally or oversample underrepresented subgroups to ensure coverage. Document your strata definition, your target proportions, and your achieved proportions.
3. Free from errors to the extent possible, with exceptions documented (Art. 10(3))
The regulation recognizes that perfect data does not exist. What it requires is that you have systematic processes to detect and remove errors, that you document the error rate of your dataset, and that where errors remain (because removal would harm representativeness), you document why.
Practically: implement inter-annotator agreement (IAA) measurement during annotation, set quality thresholds for annotation acceptance, and produce a final dataset quality report with your measured error rate and methodology.
4. Complete — all relevant features and characteristics documented (Art. 10(2)(c))
Every preprocessing decision — normalization, filtering, augmentation, resampling — must be logged and documented. “We cleaned the data” is not sufficient. Auditors want to see version-controlled, step-by-step records of every transformation applied between raw collection and final training set.
5. Appropriate statistical properties — size, variety, and distribution (Art. 10(3))
This requirement pushes back against the common practice of collecting the minimum viable dataset. You must document the statistical reasoning behind your dataset size, demonstrate that you have sufficient samples per stratum to support the statistical inferences the model is expected to make, and analyze the distribution properties of your data.
Sample size calculations with confidence intervals are the appropriate evidence here. If you cannot explain why your dataset is large enough to support your task’s requirements, you cannot satisfy this requirement.
6. Examined for biases, including with respect to protected characteristics (Art. 10(2)(f))
This is not a post-hoc review. Article 10 requires that you proactively examine your data for biases related to characteristics that are protected under EU law: age, sex, gender, racial or ethnic origin, disability, sexual orientation, religion. You must document your examination methodology, the results (including biases found), and what mitigations were applied.
Where biases cannot be fully mitigated, you must document why they remain and what residual risk they represent.
7. Data governance documentation — origin, purpose, collection methodology (Art. 10(2))
The provenance chain from raw source to training set must be documented. Who collected the data, under what legal basis, using what methodology, at what dates, in what geography, and with what intermediate transformations. Third-party datasets are not exempt — you are responsible for auditing and documenting their provenance too.
The GDPR Tension
There is a genuine legal tension between GDPR’s data minimization principle (Art. 5(1)(c)) — collect only what you need — and Article 10’s requirement for representative coverage, which may push you to collect more demographic breadth than a minimalist interpretation of GDPR would allow.
The practical resolution: use anonymized or pseudonymized data where possible, use consent-based collection with explicit purpose specification when collecting identifiable demographic data, and document the legal basis for each demographic variable you collect. This is not an unsolvable problem, but it requires intentional design rather than treating the two regulations as separate concerns.
The Engineering Checklist
This is the operational core of Article 10 compliance. Use this as a literal project checklist.
Data Collection Phase
-
Define the target population: Who is the AI system going to be applied to? What is the realistic demographic range of users or subjects? Document this in writing before any data collection begins.
-
Define stratification variables: Based on the target population, identify which demographic and operational variables require stratified coverage. For speech AI: age brackets, gender, language dialect, accent, recording environment (clean/noisy), speaking style. For medical imaging: imaging modality, equipment manufacturer, patient age, patient skin tone, disease presentation type.
-
Calculate sample sizes per stratum: Use standard statistical methods — power analysis for classification tasks, minimum sample size calculations for rare subgroups. Document your target n per stratum, your confidence interval, and the assumptions behind the calculation.
-
Document legal basis under GDPR before collection: Choose and document Art. 6(1)(a) (consent), Art. 6(1)(b) (contract performance), Art. 6(1)(e) (public task), or Art. 6(1)(f) (legitimate interest). If collecting special category data under Art. 9 (health data, biometric data), document your Art. 9(2) basis separately.
-
Implement consent documentation if using consent basis: Informed consent records with timestamp, data subject ID (anonymized for documentation), consent scope, and withdrawal mechanism.
-
Document data sources at collection time: For each batch collected — source identity (collection partner or internal), collection method, collection date range, geographic location, recording conditions, equipment used.
-
Design and implement PII handling: Define what PII will be present, how it will be anonymized before annotation, and the timeline for anonymization. Annotators should not see identifiable information unless operationally necessary.
-
Achieved vs. target demographics report: Before closing the collection phase, produce a report comparing target proportions to achieved proportions per stratum. Document gaps and whether they require additional collection or acceptance with documented limitation.
Annotation and Quality Phase
-
Annotation guidelines versioned and stored: Every instruction given to annotators must be versioned and retrievable. Auditors may ask to see the exact guidelines used at the time of annotation.
-
Inter-annotator agreement measured: Implement IAA measurement as a systematic process, not a one-off check. Use Cohen’s kappa for categorical annotation, Krippendorff’s alpha for ordinal, or Pearson correlation for continuous. Document your threshold for acceptance.
-
Quality review sample: Randomly sample a percentage of completed annotations for expert review. Document the sample size, reviewer role, and pass/fail rate.
-
Error rate documented: Produce a final dataset error rate estimate based on IAA and quality review findings. Document methodology.
-
Annotation metadata logged: For each annotated item, log the annotator ID (anonymized), annotation timestamp, tool version, and any flags or reviews applied.
Data Documentation Phase (Data Card)
-
Dataset name and version: Semantic versioning (major.minor.patch) for datasets, not just dates.
-
Intended use statement: A one-paragraph description of the specific AI system and use case this dataset was collected for. Include what it should NOT be used for.
-
High-risk category: Explicitly state which Annex III category applies to the intended system.
-
Collection methodology: Detailed enough that someone could reproduce the collection process. Includes recruiting method, screening criteria, recording protocol, equipment specifications, payment structure.
-
Demographic statistics: Distribution tables for all stratification variables. Achieved vs. target comparison. Any gaps with explanation.
-
Known limitations: What is NOT in this dataset? What populations, conditions, or scenarios are underrepresented? This is not a weakness to hide — it is a required disclosure.
-
Data quality metrics: Error rate (with methodology), IAA scores (with methodology), quality review pass rate, any systematic quality issues found and how they were handled.
-
Bias examination results: See bias examination section below.
-
Provenance chain: Numbered list from source to training system. See template in Section 3.
-
GDPR documentation pointers: Legal basis, DPA references, retention period, data processor identity, data subject rights mechanism.
Bias Examination Phase
-
Define protected characteristics in scope: Based on your AI system’s application and target population, determine which protected characteristics (age, sex, gender, racial/ethnic origin, disability, etc.) are relevant to examine. Document why others are excluded if applicable.
-
Run distributional analysis: For each protected characteristic, compute the distribution in your dataset and compare to the target population baseline. Use statistical tests appropriate to the data type — chi-squared for categorical, Kolmogorov-Smirnov for distributional comparison.
-
Test for annotation bias: If your dataset includes human annotations, test whether annotators from different demographic groups produced systematically different labels. This is particularly important for subjective tasks like sentiment, toxicity, or quality rating.
-
Check for proxy variables: Identify features that correlate with protected characteristics and may serve as proxies in model training. Geographic codes, names, language variety, and audio acoustic features can all correlate with demographic variables.
-
Document findings: Every bias found must be documented — what it is, what statistical evidence was used to detect it, what its magnitude is.
-
Document mitigations applied: For each identified bias: what mitigation was applied (resampling, augmentation, re-weighting, data collection gap-fill), and what residual bias remains.
-
Document unmitigated biases: If a bias exists that was not fully mitigated, document why (e.g., insufficient data available for that subgroup, mitigation would harm representativeness of a different dimension) and what the residual risk is.
-
Record examiner identity: Role (not necessarily name), date of examination, and methodology used. The examination must be attributable to a specific role and be repeatable.
Training and Validation Split Documentation
-
Document split methodology: Was the split random or stratified? If stratified, which variables were used for stratification? Document the tool or script used.
-
Verify test set representativeness: The test set must represent the target population, not just be a random holdout. Run the same demographic distribution analysis on your test set that you ran on the full dataset. Document the comparison.
-
Verify validation set isolation: Confirm that no information leakage occurred between training and validation sets (no shared data subjects, no shared recording sessions).
-
Version-lock splits: Once splits are established for a training run, they must be immutably version-locked. Auditors need to be able to reproduce the exact split used for a specific model version.
Ongoing Compliance Checkpoints
-
Data version control system in place: Every dataset version used in any training run must be identifiable and retrievable. DVC, Delta Lake, or equivalent.
-
Dataset update procedures documented: When new data is added to a dataset, what review process applies? Does the bias examination need to be re-run? What triggers a version bump?
-
Incident response for data quality issues: What happens if a data quality issue is discovered post-training? Who is notified, what review process applies, when is a model retrain required?
-
Erasure request handling for training data: If a data subject exercises Art. 17 GDPR right to erasure, what is the process for removing their records from the dataset? What happens to trained models that may have incorporated their data? Document the policy.
Documentation Templates
Template 1: Article 10 Data Card (Minimum Required Fields)
Copy this template and complete it for each dataset used to train or fine-tune a high-risk AI system.
======================================================
ARTICLE 10 DATA CARD
======================================================
DATASET IDENTIFICATION
----------------------
Dataset Name: [descriptive name]
Version: [major.minor.patch]
Date of This Card: [YYYY-MM-DD]
Prepared By: [role, team — not necessarily name]
INTENDED USE
------------
Intended AI System: [specific AI application]
Intended Task: [classification / regression / generation / etc.]
Annex III Category: [e.g., "Annex III, Point 6: Biometric identification"
or "Annex III, Point 1: ADAS safety component"]
Out-of-Scope Uses: [explicitly list what this dataset should NOT be used for]
DATA COLLECTION
---------------
Collection Method: [participant recording / web scraping / existing corpus /
synthetic / mixed — describe in detail]
Collection Period: [YYYY-MM-DD to YYYY-MM-DD]
Geographic Coverage: [list countries or regions]
Languages/Modalities: [list, with dialect information if relevant]
Collection Partner: [internal / vendor name / open source corpus name]
Total Samples: [n after quality filtering]
Excluded Samples: [n excluded, reasons for exclusion]
DEMOGRAPHICS (for person-related data)
---------------------------------------
Age Range: [min – max, median]
Distribution: [bracket breakdown, e.g., "18-30: 22%, 31-45: 35%…"]
Target vs. Achieved:[comparison table or statement]
Gender Distribution: [percentages, note self-reported vs. inferred if applicable]
Target vs. Achieved:[comparison]
Geographic/Regional: [country or region breakdown]
Target vs. Achieved:[comparison]
Other Relevant Variables:
[list additional strata relevant to your application]
KNOWN LIMITATIONS
-----------------
Underrepresented groups: [list]
Excluded conditions/contexts:[list]
Temporal scope limitations: [e.g., "collected 2024-2025; does not reflect
speech patterns that emerge post-2025"]
Other known gaps: [list]
DATA QUALITY
------------
Annotation Type: [label type, task description]
Annotation Tool: [tool name and version]
Annotator Count: [n annotators]
Inter-Annotator Agreement: [metric name, score, methodology]
Quality Review Sample: [n% reviewed, pass rate]
Final Error Rate Estimate: [%, methodology used to estimate]
Known Quality Issues: [list any systematic issues and how handled]
BIAS EXAMINATION
----------------
Examination Date: [YYYY-MM-DD]
Examiner Role: [e.g., "Data Governance Lead"]
Protected Characteristics Examined:
- [characteristic 1]: [method] → [finding] → [mitigation applied]
- [characteristic 2]: [method] → [finding] → [mitigation applied]
Annotation Bias Test: [conducted / not applicable — explain]
Proxy Variable Analysis: [conducted / not applicable — explain]
Unmitigated Biases:
- [If any]: [description, statistical magnitude, reason not mitigated,
residual risk assessment]
GDPR / LEGAL BASIS
------------------
Legal Basis: [Art. 6(1)(a) Consent / Art. 6(1)(f) Legitimate
Interest / other — with justification]
Special Category Basis: [Art. 9(2)(x) if applicable, or "N/A"]
Data Controller: [organization name]
Data Processor (if external):[name, DPA reference]
Retention Period: [duration and policy]
Erasure Mechanism: [how Art. 17 requests are handled for this dataset]
PROVENANCE CHAIN
----------------
Step 1: [Data origin — source, date, legal basis]
Step 2: [Transfer to collection partner — DPA reference if applicable]
Step 3: [Raw data ingestion — date, format, hash/checksum]
Step 4: [Preprocessing — transformations applied, tool, version]
Step 5: [Annotation — tool, guidelines version, date range]
Step 6: [Quality review — date, reviewer role, results]
Step 7: [Final dataset assembly — date, version lock, hash/checksum]
Step 8: [Transfer to training infrastructure — date, access controls]
TRAINING SPLIT
--------------
Split Method: [random / stratified — if stratified, variables used]
Training Set Size: [n]
Validation Set Size: [n]
Test Set Size: [n]
Test Set Representativeness: [summary of demographic distribution analysis]
Split Version Lock: [hash or identifier of immutable split]
VERSION HISTORY
---------------
Version Date Changes
------- ---------- -------
1.0.0 YYYY-MM-DD Initial release
======================================================
Template 2: Bias Examination Report (Minimum Format)
This report documents the bias examination conducted per Article 10(2)(f). It can be a standalone document referenced in the Data Card or embedded within it for smaller datasets.
======================================================
ARTICLE 10 BIAS EXAMINATION REPORT
======================================================
EXAMINATION METADATA
--------------------
Dataset: [name and version]
Examination Date: [YYYY-MM-DD]
Examiner: [role — e.g., "Data Governance Lead, YPAI"]
Scope Statement: This examination was conducted to satisfy the requirements
of EU AI Act Article 10(2)(f) for the above dataset.
PROTECTED CHARACTERISTICS IN SCOPE
------------------------------------
Characteristic | In Scope | Rationale if Excluded
-----------------------|----------|-----------------------------
Age | [Y/N] | [if N: justification]
Sex / Gender | [Y/N] | [if N: justification]
Racial/Ethnic Origin | [Y/N] | [if N: justification]
Disability | [Y/N] | [if N: justification]
Sexual Orientation | [Y/N] | [if N: justification]
Religion | [Y/N] | [if N: justification]
Socioeconomic Status | [Y/N] | [note: not a protected characteristic
| | but relevant for representativeness]
STATISTICAL ANALYSIS
--------------------
For each in-scope characteristic:
[Characteristic: Age]
Analysis Method: [Chi-squared test / distributional comparison / other]
Baseline Reference: [target population source, e.g., Eurostat 2024]
Result: [p-value, distribution comparison]
Finding: [e.g., "Speakers aged 65+ underrepresented: 4.2% in
dataset vs. 18.5% in target population baseline"]
Mitigation Applied: [e.g., "Additional 340 recordings collected for 65+
age group, bringing representation to 14.8%"]
Residual Bias: [e.g., "3.7% gap remains due to recruitment difficulty;
documented as known limitation"]
[Characteristic: Gender]
Analysis Method: [...]
Baseline Reference: [...]
Result: [...]
Finding: [...]
Mitigation Applied: [...]
Residual Bias: [...]
[Repeat for each in-scope characteristic]
ANNOTATION BIAS TEST
---------------------
Method Used: [e.g., "Cross-tabulation of annotator demographic group
vs. label distribution for quality rating task"]
Result: [e.g., "No statistically significant difference detected
across annotator groups (p=0.34 chi-squared)"]
OR
[e.g., "Annotators from Group X rated audio quality 0.3
points lower on average (p=0.02); investigated and
attributed to recording equipment familiarity; mitigation:
calibration session and guideline update"]
PROXY VARIABLE ANALYSIS
------------------------
Variables Examined: [list features examined for demographic correlation]
Correlations Found: [e.g., "Regional accent label correlates with geographic
origin (r=0.71); treated as expected, documented"]
Problematic Proxies: [any features that could serve as unintended proxies
in model training — mitigation steps applied]
SUMMARY
-------
Biases Found: [count and brief description]
Biases Mitigated: [count and brief description]
Residual Biases: [count, description, and risk assessment]
Overall Assessment: [This dataset has been examined for biases in
accordance with EU AI Act Article 10(2)(f). The
examination found [n] bias(es), of which [n] were
mitigated. Residual biases are documented above.]
CERTIFICATION
-------------
Examined by: [Role] on [date]
This report is maintained as part of the technical documentation for the
AI system referenced in the Dataset Identification section above, in
accordance with Article 11 EU AI Act.
======================================================
Common Mistakes That Cause Audit Failures
These are not theoretical — they are patterns that appear repeatedly when organizations try to document compliance retroactively.
1. Confusing “representative” with “balanced”
Balanced means equal numbers across groups. Representative means proportional to the target population. These are almost never the same thing. A speech recognition system for elderly care in Germany should have more speakers aged 70+ than a general-purpose system — because that is the target population. Documenting 50/50 gender split when the target deployment population is 80% female is not compliance; it is documentation of the wrong thing.
2. Writing the Data Card after the model is trained
Data governance documentation must be contemporaneous with the process it documents. When you write a collection methodology description six months after the data was collected, you are producing a reconstruction, not a record. Auditors know the difference. The methodology document you wrote before collection started is verifiable; the one you wrote afterward is not.
Implement documentation as part of your data pipeline — not as a post-processing task. The Data Card fields should be populated progressively as each phase completes.
3. Skipping bias examination on the validation and test sets
Most teams examine the training set for bias. Fewer examine their validation and test sets with equal rigor. If your test set does not represent the target population — if it over-indexes on easy examples or well-represented subgroups — your performance metrics do not reflect real-world behavior. Article 10 requires that training data practices apply to the data “used for” the system, which regulators interpret as including validation and test data.
4. Treating Article 10 as a one-time check
Article 10 compliance is not a checkbox at dataset creation time. Training data evolves — you add new data, you discover quality issues, data subjects exercise erasure rights. Each change to the dataset potentially affects its representativeness, quality metrics, and bias examination results. Implement a change management process: when does a dataset update require a new bias examination? When does it require a new quality audit? Document the policy.
5. “We scraped the web” as a collection methodology
This is not a documentation of methodology — it is an admission of inadequate documentation. A compliant collection methodology includes: the search strategy and terms used, the sources included and excluded and why, the date range of content collected, the geographic scope, the filtering criteria applied (content type, language, quality filters), the deduplication methodology, and the legal basis for collection from each source type. If you cannot reconstruct what went into your dataset, you cannot satisfy Article 10(2).
6. Not documenting what you did NOT include
Article 10 compliance requires documenting known gaps and limitations. A dataset that is honest about what it does not cover — and why — is a compliant dataset. A dataset with no acknowledged limitations is a dataset whose documentation has not been completed. Auditors are not looking for perfect datasets; they are looking for honest characterization of the dataset actually used.
7. Third-party dataset pass-through
“The dataset came from [vendor/open source project]; their documentation covers compliance.” This does not work under Article 10. You are responsible for the compliance of all data used in your system, regardless of source. You must review third-party datasets against Article 10 requirements, document your review, and conduct your own bias examination. Request documentation from vendors; if they cannot provide it, treat the dataset as undocumented and either document it yourself or exclude it.
How Article 10 Interacts with GDPR
These two regulations operate in the same space and create genuine tensions. Here is the engineering-practical version.
The right-to-erasure problem
Under GDPR Article 17, data subjects can request erasure of their data. If you honor an erasure request and remove a speaker’s recordings from your dataset, your dataset’s representativeness may change — if that speaker was in an underrepresented subgroup, their removal makes the dataset less representative. Document a policy for how you handle this: what is your process for assessing whether an erasure materially affects dataset representativeness, and what is the trigger for conducting a new representativeness analysis?
Consent-based collection creates ongoing obligations
If your legal basis for data collection is consent (Art. 6(1)(a)), data subjects retain the right to withdraw consent at any time. This means your training dataset is not stable — it can shrink. From a practical engineering standpoint: if you are using consent as your legal basis, your data pipeline must support dataset versioning that tracks which samples are affected by withdrawal, and your model retraining process must account for the possibility that the dataset used to train a deployed model differs from the dataset you have available today.
Some organizations choose legitimate interest (Art. 6(1)(f)) specifically to avoid this instability — but legitimate interest for training data collection requires a documented balancing test showing that your interests outweigh the data subjects’ rights, which is not automatic for sensitive or special category data.
Data minimization vs. representativeness
GDPR Art. 5(1)(c) requires collection of only the minimum data necessary. Article 10 requires representative coverage of the target population, which may require collecting broader demographic information than a minimalist view of the task would suggest.
The resolution is not to ignore one or the other but to design data collection with both requirements in mind:
- Collect demographic metadata under a separate, specific legal basis from the task content
- Anonymize demographic identifiers after using them for stratification verification
- Document why each demographic variable is necessary for achieving representativeness
- Avoid collecting demographic data that you have no statistical plan to use
Special category data (racial/ethnic origin, health data, biometric data) requires explicit Art. 9(2) basis regardless of the Art. 6 basis for the main data collection. Design this into your consent architecture from the start.
The anonymous data escape hatch — and its limits
Truly anonymous data (not pseudonymized — genuinely anonymous) falls outside GDPR scope. If you can design your data collection and processing to produce anonymous training data — for example, transcribing speech without retaining the audio, or using aggregated imaging data without patient-level records — you may be able to reduce GDPR complexity while satisfying Article 10.
The catch: anonymization for training data often means you lose the metadata needed to demonstrate representativeness. If you anonymize before completing your demographic analysis and documentation, you may satisfy GDPR but undermine your Article 10 documentation. The sequencing matters: conduct your demographic analysis and produce your Data Card before anonymization, then anonymize before the annotation phase.
Resources and Next Steps
The official Article 10 text is available at EUR-Lex: EU AI Act, Article 10. The recitals 44 through 49 provide interpretive context for the data governance requirements.
The AI Office’s technical standards on Article 10, developed by CEN/CENELEC, are still in draft but will be the definitive interpretive guidance once published. Monitor the AI Office website for publication.
For practical implementation, Google’s Datasheets for Datasets (Gebru et al.) and Data Cards provide academic foundations for the documentation frameworks that auditors will recognize and respect.
The August 2, 2026 deadline will not move. The organizations that will have audit-defensible documentation on that date are the ones that started the documentation process during data collection, not after model training.
If you need training data that is already designed for Article 10 compliance — with Data Cards, bias examination reports, stratified demographic coverage, and full provenance documentation as standard deliverables — YPAI’s speech data collection services and GDPR-compliant data programs are built for exactly this requirement. Our automotive AI data programs include Article 10 documentation packages as part of the engagement.
Related YPAI Content
- EU AI Act Article 10: Data Governance — deeper dive into the MLOps pipeline architecture for Article 10 compliance
- EU AI Act high-risk AI training data requirements — which Annex III categories apply and what the data quality standards require in practice
- GDPR-compliant speech data collection in Europe — lawful basis, consent documentation, and vendor checklist for voice data under GDPR
- CTOs guide to sovereign AI architecture and costs — how EU AI Act compliance fits into the broader sovereign AI infrastructure decision
- EU AI Act compliant training data services
- GDPR-compliant speech data collection
- Automotive AI data solutions
- Speech data technical specifications
Sources:
- EU AI Act Official Text, Article 10 — EUR-Lex
- Datasheets for Datasets — Gebru et al., arXiv:1803.09010
- Data Cards: Purposeful and Transparent Dataset Documentation — Pushkarna et al., FAccT 2022
- EU AI Act Recitals 44–49 (data governance interpretive context)
- Fairlearn: A toolkit for assessing and improving fairness in AI — Microsoft Research
- Great Expectations: Data quality documentation framework