OpenDataPolicyRecommendation
From OpenGovData
This is a work-in-progress. The goal is to take the OpenDataPrinciples formulated at the Sebastopol meeting and turn them into a polished document that can be given to policymakers.
A suggested outline is:
- Introduction to the document
- The Eight Principles in legalese: Language that can be incorporated into policy.
- Annotations to the principles: How they are to be understood. Explanations of technical terms that were needed in the legalese in the previous part.
- Recommendations for implementation: What technologies to use and avoid, how to measure success.
As a start, I am concatenating the contents of OpenDataPrinciples and the subpages below (on 2008-06-06).
Some questions to be answered in the recommendations section:
- Practical suggestions
- How much it will cost
Contents |
Introduction
The worldwide web is the public sphere of the modern world. Governments now have the opportunity to better communicate with and understand the needs of their citizens, while citizens may participate more fully in their government through the web. Information is not valuable or useful when it is hoarded or inaccessible; it becomes more valuable as it is shared. Open data promotes more and more informed civil discourse, improved public welfare, and a more efficient use of public resources.
This document offers a set of fundamental principles for open government data, where data means electronically stored information, including documents, recordings, and databases. The power of data comes from the ability of computer programs to sort, search, and transform it for a wide range of uses, including improved access to the disabled, support for civic education, and new ways to assist government oversight activities within government, the press, and public institutions.
By embracing these principles, governments of the world can become more effective, transparent, and relevant to our lives. The principles follow from discussions documented on the website www.OpenGovData.org.
The principles below do not address what data should be public and open. Privacy, security, and other concerns may legally (and rightly) prevent data sets from being shared with the public. Rather, these principles specify the conditions public data should meet to be considered "open". It is expected when adopting this policy recommendation that the data to be made "open" be properly specified.
The American Library Association's "Key Principles of Government Information" outlines a set of principles similar to those contained in this document, such as:
- Use non-proprietary, open data formats that maintain consistency over time
- Maintain privacy
- Keep information license-free
Policy Recommendation
Open government data policy can be enacted in several ways.
- One way is to mandate narrowly that certain data gathered or otherwise held by the government be made "open government data", under the definition below. Such data might be government documents, legislative databases, transcripts and audio/visual recordings of hearings, etc.
- A second way to enact an open data policy would be to make it a widely applicable recommendation to all electronic information resources but require regular independent audits of public government data to assess their compliance with the definition of "open government data".
What follows is the definition of "open government data".
Government data shall be considered open if the data are made public in a way that complies with the principles below:
1. Data Must Be Complete
- All public data are made available. Data are electronically stored information or recordings, including but not limited to documents, databases, transcripts, and audio/visual recordings. Public data are data that are not subject to valid privacy, security or privilege limitations, as governed by other statutes.
2. Data Must Be Primary
- Data are published as collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.
3. Data Must Be Timely
- Data are made available as quickly as necessary to preserve the value of the data.
4. Data Must Be Accessible
- Data are available to the widest range of users for the widest range of purposes.
5. Data Must Be Machine processable
- Data are reasonably structured to allow automated processing of it.
6. Access Must Be Non-Discriminatory
- Data are available to anyone, with no requirement of registration.
7. Data Formats Must Be Non-Proprietary
- Data are available in a format over which no entity has exclusive control.
8. Data Must Be License-free
- Data are not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed as governed by other statutes.
Finally, compliance must be reviewable.
- A contact person must be designated to respond to people trying to use the data.
- A contact person must be designated to respond to complaints about violations of the principles.
- An administrative or judicial court must have the jurisdiction to review whether the agency has applied these principles appropriately.
Annotations
- 1. Data Most Be Complete
While non-electronic information resources, such as physical artifacts, are not subject to the Open Government Data principles, it is always encouraged that such resources be made available electronically to the extent feasible.
- 2. Data Must Be Primary
Primary data is an important aspect of compliance with the Open Government Data principles. All too often, audio, video, and images are only made available at low resolution to Internet user, making the data impossible to use in any professional application. The choice of an appropriate "low" resolution format yesterday begins to look unusable by the standards of today. If an entity chooses to transform data by aggregation or transcoding for use on an Internet site built for end users, it still has an obligation to make the full-resolution information available in bulk for others to build their own sites with and to preserve the data for posterity.
Just as one should not destroy information by presenting and preserving only low-resolution imagery, numeric or tabular data should not be aggressively aggregated for use in one particular Internet application at the cost of throwing public information that could be used.
The determination of what is an acceptable level of granularity to present and preserve is a moving target and should be based on best practices of the time, with a heavy bias towards "more is better."
- 3. Data Must Be Timely
What is reasonable depends on the nature of the data set. As an example, when the data is a record of ongoing events, is relevant to current policy debate, or is otherwise time sensitive, a delay of more than one month is not acceptable. On the other hand, geographic data collected for purposes independent of any current policy debate, for example, may releasing data periodically in bulk.
Newly updated complete data sets should be provided in a timely manner as well. Time-sensitive data sets should be updated at the same frequency with which the data changes.
When individual records change, notices of the changes should also be made timely available.
Despite the forgoing, if data is not released in a timely manner because of technical constraints, that is not a reason to continue delaying release. Better late than never!
- 4. Data Must Be Accessible
Data must be made available on the Internet so as to accommodate the widest practical range of users and uses. This means considering how choices in data preparation and publication affect access to the disabled and how it may impact users of a variety of software and hardware platforms. Data must be published with current industry standard protocols and formats, as well as alternative protocols and formats when industry standards impose burdens on wide reuse of the data, and this includes honoring handicapped-accessibility initiatives.
If the data is accessible from a Web interface, there must be some straightforward means of exporting it (flattening it) to be inspected in raw form directly and imported into other tools.
- 5. Data Must Be Machine Processable
The ability for data to be widely used requires that the data be properly encoded. Free-form text is not a substitute for, e.g., tabular and normalized records. Images of text are not a substitute for the text itself. Sufficient documentation on the data format and meanings of normalized data items must be available to users of the data.
Following the principle that data must be accessible, the accessibility must extent to automated access. If the data is accessible from some kind of interface, it must be possible to download the complete data set in raw form through an automated process.
- 6. Access Must Be Non-Discriminatory
Anonymous access to the data must be allowed for public data, including access through anonymous proxies. Data should not be hidden behind "walled gardens," accessible only to certain classes of Internet users. To use analogies from earlier periods of the Internet, data only accessible via AOL, Internet 2, or Bloomberg would be considered to be presented in a discriminatory manner. This principle reiterations some of the goals of principle 4, accessibility.
- 7. Data Formats Must Be Non-Proprietary
Proprietary formats add unnecessary restrictions over who can use the data, how it can be used and shared, and whether the data will be usable in the future.
While some proprietary formats are nearly ubiquitous, it is nevertheless not acceptable to use only proprietary formats. Likewise, the relevant non-proprietary formats may not reach a wide audience. In these cases, it may be necessary to make the data available in multiple formats.
Recommendations
- 4. Data Must Be Accessible
The first part of the accessibility principle speaks of availability, meaning the ability for the entirety of the data to be acquired over the Internet. A data set being large does not exempt it from the requirements in this section. Disks are cheap and high definition video is no longer hard to achieve and distribute. When data sets are too large to be made available in whole, in bulk, directly from the source, assistance from the nonprofit and private sector must be sought. As a last resort, a rotation scheme can be deployed to make available a limited window of data at a time.
Accessibility also relates to uses by disabled individuals. Accessibility initiatives to be followed include the World Wide Web Consortium's Web Accessibility Initiative, and in the United States Section 504 and Section 508 of the federal Rehabilitation Act and Section 255 of the federal Telecommunications Act.
Benchmarks for accessibility include whether existing tools are available to process the data and whether tools that use the data could enable vision-impaired individuals to achieve the same comprehension of the data as a sighted individuals, for instance using a Braille workstation or a screen reader, and whether non-English-speaking individuals can use a web service to translate the data (in this case a document) into another language.
- 5. Data Must Be Machine Processable
For tabular or structured data, each record should include an identifier. This identifier should be persistent across revisions to the data set so that external references to individual record can follow updates. The identifier can be a globally unique URI identifier following Semantic Web best practices, for instance. The data format should be documented so that those familiar with the domain of the data set can understand it. All columns, tags, and abbreviations should be described. However, XML schema or the like are not necessary.
A benchmark for meeting this requirement is whether a programmer can build a parser for the data in a scripting language in just an afternoon. That parser should be able to crawl through the published dataset and push the data into a database.
There should be a means of notifying users of the data to changes in the data format. A mail list or RSS feed aimed at data users, plus a document describing the history of the data format, are recommended.
- 7. Data Formats Must Be Non-Proprietary;
- 8. Data Must Be License-free
Benchmarks for meeting these requirements include whether the data can be used in applications based entirely on free software (including license and patent free), and whether individuals are able to redistribute the data without restriction, without requiring the permission of any third party (including the government).
