The AGORA (AI GOvernance and Regulatory Archive) dataset is a living collection of AI-relevant laws, regulations, standards, and other governance documents from the United States and around the world. AGORA includes document text, metadata, summaries, and thematic tags to enable deep analysis of the global AI governance landscape.
The dataset is available on Zenodo.
The AGORA dataset powers ETO's AGORA tool.
The AGORA dataset is an original Emerging Technology Observatory resource. Document text and some metadata are taken from official sources. Other metadata, summaries, and tags are produced by ETO analysts and annotators.
This dataset is subject to ETO's general terms of use. If you use it, please cite us.
Note that the dataset includes full text of AGORA documents, taken from sources such as government websites and repositories. Given its nature, we believe all of this material is open to non-commercial use consistent with our general terms of use, but we make no warranties.
No.
Please cite the "Emerging Technology Observatory AGORA dataset," including the link.
If you use AGORA to access the data, you can cite that tool instead.
The AGORA dataset consists of the csv tables documents, segments, collections, and authorities and a folder of text documents called fulltext.
This table includes core metadata, summaries, and thematic tags for AGORA documents.
Column name | Type | Description |
---|---|---|
AGORA ID | Number | A unique numerical identifier for the document. |
Official name | Text | The full, official name of the document according to an authoritative record. If the name is extremely long, it may be truncated. |
Casual name | Text | A colloquial name for the document, as defined in the document itself or as chosen by a screener. |
Link to document | URL | A link to an authoritative record of the document. In virtually all cases, this means the record of the document on the official website of the authority that issued it. |
Authority | Text | The authority that issued the document. Corresponds to a row in authorities. |
Editor's pick | Boolean | If marked True, this record is an AGORA Editor's Pick, meaning our team thought it was especially interesting or important. |
Collections | Text array | A screener assigns each document to one or more collections, listed here. Each collection corresponds to a row in collections. |
Most recent activity | Text | One of the following options, as determined by a screener: "Proposed"; "Enacted" (the document has been officially approved and finalized and is in or will be entering into effect); "Defunct." |
Most recent activity date | Date | The date on which the document attained the status indicated in Most recent activity according to official sources. |
Proposed date | Date | The date on which the document was proposed. (If Most recent activity is Proposed, this will be the same as Most recent activity date.) |
Annotated? | Boolean | TRUE if an annotator finished reviewing and populating fields in this table for the segment. |
Validated? | Boolean | TRUE if a validator finished reviewing and resolving issues in fields in this table for the segment. |
Primarily applies to the government | Boolean | TRUE if the document primarily applies to the government (e.g., an AI strategy for a government agency, or a directive to a government official to enact a regulation) rather than governing private behavior. |
Primarily applies to the public sector | Boolean | TRUE if the document primarily applies to the private sector or civil society (e.g., conditions on the sale of AI systems, or a law banning individuals from using AI for specific purposes). A document that (implicitly or explicitly) applies to government actors as well as actors outside government would be assigned a value of FALSE for this field. |
Short summary | Text | A skimmable (1-2 sentence length) summary of the AI-related content of the document. |
Long summary | Text | A detailed (1-2 paragraph equivalent) summary of the AI-related content of the document. |
Segment-level tags | Text array | For documents that were segmented for detailed annotation (the typical case), this field lists all thematic tags applicable to one or more of the document's segments. |
Document-level tags | Text array | For documents that were not segmented for detailed annotation (a minority of AGORA documents), this field lists all thematic tags applicable to any part of the document as a whole. |
Summaries and tags include unreviewed machine output | Boolean | TRUE if any summary or tag fields in this table include content generated by a language model and the document has not yet been annotated by a human (i.e., Annotated? is FALSE). |
Official plaintext retrieved | Date | If fulltext for the document is available in fulltext, this field lists the date on which the fulltext was retrieved. |
Official plaintext source | URL | If fulltext for the document is available in fulltext, this field lists the source of the fulltext. |
Official plaintext unavailable/infeasible | Boolean | TRUE if fulltext for the document could not be retrieved in machine-readable form. |
Official pdf source | URL | If a pdf of the document is available in fulltext, this field lists the source of the pdf. |
Official pdf retrieved | Date | If a pdf of the document is available in fulltext, this field lists the date of retrieval. |
Number of segments created | Number | The number of segments into which the document was divided for detailed annotation. If 0, the document was not segmented. |
[subsequent fields] | boolean | The remaining fields in this table disaggregate the thematic tags listed in Segment-level tags or Document-level tags, as applicable, for ease of analysis. Each field corresponds to one tag and indicates whether the tag does or does not apply with the values TRUE and FALSE, respectively. |
This table includes metadata, summaries, and thematic tags for individual segments of AGORA documents.
Column name | Type | Description |
---|---|---|
Document ID | number | |
Segment position | number | The position of the segment within the document relative to other segments. |
Text | text | The text of the segment. (Use AGORA text with caution.) |
Tags | text array | A list of thematic tags applicable to the segment, standing alone (i.e., without considering cross-references or the definitions of defined terms that appear in the segment). |
Summary | text | A short summary of the AI-related content of the segment. |
Non-operative | boolean | TRUE if an annotator determined that none of the text of the segment was operative, i.e., having practical effect. Examples of non-operative text include tables of contents and explanatory preambles. |
Not AI-related | boolean | TRUE if an annotator determined that none of the text of the segment related to AI (taking into account the definitions of defined terms and cross-references to other parts of the relevant document). |
Segment annotated | boolean | TRUE if an annotator finished reviewing and populating fields in this table for the segment. |
Segment validated | boolean | TRUE if a validator finished reviewing and resolving issues in fields in this table for the segment. |
Summaries and tags include unreviewed machine output | boolean | TRUE if either Summary or Tags include content generated by a language model and the segment has not yet been annotated by a human (i.e., Segment annotated is FALSE). |
[subsequent fields] | boolean | The remaining fields in this table disaggregate the thematic tags listed in tags for ease of analysis. Each field corresponds to one tag and indicates whether the tag does or does not apply with the values TRUE and FALSE, respectively. |
This table lists and describes the collections AGORA documents are assigned.
Column name | Type | Description |
---|---|---|
Name | text | The name of the collection. |
Description | text | A short description of the collection. |
This table lists and describes the authorities associated with AGORA documents.
Column name | Type | Description |
---|---|---|
Name | text | The name of the authority. |
Jurisdiction | text | The legal jurisdiction corresponding to or containing the authority (for example, "United States" for the authority "California"). |
Parent authority | text | If the authority is part of a group of similar authorities, a parent authority group is specified (for example, "State governments" for the authority "California"). |
The "fulltext" folder includes individual files with the full text of each AGORA document for which full text has been collected. Each file is labeled with the AGORA ID of the corresponding document.
The AGORA dataset includes laws, regulations, standards, and similar documents that directly and substantively address the development, deployment, or use of artificial intelligence technology. The intent of this scoping definition is to encompass the large majority of documents created by lawmakers, regulators, and standard-setters in direct response to advances in modern machine learning and related technologies.
Applying subjective elements of this definition, such as “directly and substantively,” inevitably involves judgment. When screening documents for inclusion in AGORA, we try to constrain this judgment by defining heuristics.
Critically, the requirement that documents “directly” address artificial intelligence generally excludes laws predating the rise of modern machine learning, even if they are broad enough in scope to bear on AI. We draw this line to ensure that AGORA’s scope is manageable in practice and to reinforce the dataset’s emphasis on policies created in response to 21st century developments in AI, rather than the entire set of policy documents that may affect individual sectors and governance writ large. Note, however, that more recent documents that tailor these broad laws to the specific context of AI would qualify for inclusion in AGORA. For example, while the Civil Rights Act of 1964 would not be included in AGORA, a related federal regulation or guidance document applying the Act to racially discriminatory AI is within AGORA’s scope.
AGORA’s nominal scope is broader than the set of documents collected to date. In particular, the current dataset skews toward U.S. law and policy. New data are added regularly and we plan to broaden coverage over time.
For now, AGORA aims to include the following documents (to the extent they are within AGORA's scope) with a lag of no more than a few months:
Our next priorities are to broaden coverage of U.S. state documents, with the aim of including all enacted, in-scope state laws on a going-forward basis, and to broaden coverage of Chinese central government documents and major corporate commitments. We will update this section as we make progress toward these goals.
Candidate documents for inclusion in AGORA are currently collected manually or using semi-automated means (e.g., saved queries against larger datasets) from a wide range of official and unofficial sources, reflecting the decentralized, largely ad-hoc status of current AI governance tracking. The most common sources include official, general-purpose regulatory compilations, such as the Congress.gov service for United States federal legislation and the Federal Register for United States federal regulation. Human screeners review these sources (periodically, in the case of sources that update) and assess each document against the AGORA scoping definition.
Screeners first determine whether documents are within AGORA's scope. Only in-scope documents are included in the AGORA dataset. For documents determined to be in scope, screeners locate the authoritative text of the document (for example, on the official website of the United States Congress or a state legislature) and use it to populate basic metadata such as title and date of introduction. Whenever feasible, they also archive a plaintext copy of the document and divide it into shorter segments (often, but not always, corresponding to sections, subsections, or similar divisions already present in the text) for granular annotation.
Annotators also identify “packages, ” or larger, thematically diverse documents containing AI-related portions amidst other, AI-unrelated material. A typical example is the annual National Defense Authorization Act (NDAA) in the United States, a massive, largely AI-unrelated law with some diverse AI-related provisions sprinkled throughout in recent years. NDAAs and other such packages are divided into conceptually discrete AGORA documents, corresponding to sections, subsections, or other subdivisions in the packages, according to standing guidance (reproduced in the appendices).
Finally, screeners identify a canonical AGORA authority for the document and assign it to one or more thematic collections. AGORA authorities and collections are listed in the authorities and collections tables, respectively.
After screening, a language model processes the full text of each in-scope document and generates draft annotations for human review. We use an "off-the-shelf" commercial LLM (accessed via API) and ETO-developed prompts.
At present, the language model is used only to generate draft summaries of each document as a whole and each segment in the document. We have found that machine-generated summaries of AGORA documents are generally reliable and useful. Regardless, we flag unreviewed machine-generated output in the dataset, and affected documents can easily be removed, if desired, by filtering on the flag fields.
In the future, we may also use language models to generate provisional thematic tags for segments.
Using the basic metadata and authoritative text compiled during the screening process, ETO annotators develop summaries and thematic tags for each in-scope document - or review and revise the machine-generated drafts, as available.
For tagging, the annotators use an extensive codebook that includes general guidance and specific definitions for each thematic tag. Tags are applied segment by segment. A custom-built Airtable interface structures the annotator workflow and facilitates quick and accurate annotation.
AGORA’s summaries are meant mainly to help users skim and sift, rather than as an analytic resource in themselves; the codebook provides brief instructions for short- and long-form summaries, but significant discretion is left to annotators.
After initial annotation, a second annotator (designated the “validator”) reviews each document in full and discusses any disagreements with the initial annotator. Note that these are not fixed roles; each AGORA annotator serves as initial annotator on some documents and as validator on others.
Once all issues identified in validation have been resolved, the document’s record, consisting of validated metadata, short and long summaries, and thematic tags, is marked complete.
AGORA includes a thematic taxonomy that is inspired by scholarly and policy literature, but intended to be useful to a wide range of potential users and reasonably intuitive to both those users and AGORA annotators. The taxonomy was drafted by an interdisciplinary team with training in law, data engineering, public policy, political science, AI governance, and quantitative and qualitative social science methods, with input from potential users in government, academia, and the private sector, and has been refined iteratively based on annotator and user feedback.
The taxonomy consists of discrete concepts (“tags”) organized into five domains:
Annotators read each AGORA document in full, then decide whether each of the 77 codes in the AGORA taxonomy applies at any point in the document, based on the definition and (where available) examples and keywords provided in the codebook. In deciding, annotators are instructed to consider only the operative text of each document; to focus on what the document explicitly states or clearly and directly implies; and to ignore material unrelated to artificial intelligence.
Records are periodically added to the public AGORA dataset (and the web interface) using an automated script. We add records as soon as they are screened, then update them as they are annotated and validated. This means that some records in the dataset often change over time, and many include unvalidated and/or machine-generated content. If desired, this content can be disregarded by filtering on the relevant fields in the data tables.
10/1/24 | Initial release |