Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for changes to spec #20

Closed
wants to merge 12 commits into from

Conversation

@diamondap
Copy link

diamondap commented Sep 12, 2019

Proposed BagIt Profile Changes for v2.0

This document proposes some breaking changes to the BagIt Profiles specification.

These changes evolved during the development of APTrust's DART tool, which (among other things) creates and validates bags. While developing DART, we found that the existing BagIt profile spec could not describe bags we needed to create and validate.

The current BagIt Profiles spec only defines tags within the bag-info.txt file. It cannot describe valid APTrust bags because APTrust requires a tag file called aptrust-info.txt, which must contain a specific set of tags, each of which has defined constraints. Other organizations in the past, such as DPN, also required a defined set of tags outside of bag-info.txt, and future organizations will likely do the same.

APTrust proposes changes to BagIt Profiles to achieve the following goals:

  1. Allow profiles to descibe tags outside the bag-info.txt file.

  2. Simplify the work of software that creates and validates bags by collecting all tag requirements in a single list within the profile.

  3. Allow profiles to tell users (optionally through bagging software user GUIs) what information is expected in tag values.

  4. Allow profiles to specify a set of valid manifests and tag manifests without prescribing which manifest algorithms must be used.

  5. Allow profiles to specify whether serialized bags must expand into a directory that matches the name of the serialized bag.

Limitations of Tag Definitions in BagIt-Profiles v1.2.0

BagIt-Profiles v1.2.0 allows users to specify which tags should be present in the bag-info.txt file, whether they're required, and which values are allowed. For example:

   "Bag-Info":{
      "Source-Organization":{
         "required":true,
         "values":[
            "Simon Fraser University",
            "York University"
         ]
      },
      "Contact-Name":{
         "required":true,
         "values":[
            "Mark Jordan",
            "Nick Ruest"
         ]
      },
      "Contact-Phone":{
         "required":false
      }

v1.2.0 also allows users to specify which tag files are required, like so:

   "Tag-Files-Required":[
     "DPN/dpnFirstNode.txt",
     "DPN/dpnRegistry"
   ]

The spec does not allow users to specify which specific tags are required in additional files like aptrust-info.txt and dpn/dpn-tags.txt.

It may be possible to define additional fields in the profile like so to add these definitions:

   "APTrust-Info":{
      "Access":{
         "required":true,
         "values":[
            "Institution",
            "Consortia",
            "Restricted"
         ]
      },
      "Title":{
         "required":true
      },

This approach presents two problems.

  1. First, in order to collect a list of all tag definitions, the software that creates or validates the bag must scan through all the keys of BagIt profile JSON structure and look for keys that don't match known names like Manifests-Required, Allow-Fetch.txt, etc. Then it must assume these keys describe tag files, and the values describe tags expected to be in those files.

  2. Second, the key names in the JSON structure don't necessarily match the tag file names. Bag-Info.txt and bag-info.txt are not the same thing on case-sensitive file systems. Asking the bagging/validation software to infer file names based on loosely matching key names invites trouble.

Proposed Fix for Tag Definitions

A single key in a BagIt profile called Tags can contain a list of all required tag files as well as definitions of the tags expected to be in them. For example:

    "Tags":[
        {
            "tagFile": "bag-info.txt",
            "tagName": "Source-Organization",
            "required": true,
            "help": "The name of the organization that produced this bag, or is responsible for its contents."
        },
        {
            "tagFile": "aptrust-info.txt",
            "tagName": "Title",
            "required": true,
            "help": "The title or name of that describes this bag's contents."
        },
        {
            "tagFile": "aptrust-info.txt",
            "tagName": "Access",
            "required": true,
            "values": [
                "Consortia",
                "Institution",
                "Restricted"
            ],
            "help": "Access rights for this bag describe who can see that it exists in the repository."
        }
    ]

Bagging software and bag validators can scan a single list in the profile definition to find all required tags in all required tag files. If a tag file contains a single required tag, the bagger/validator can assume the containing tag file is also required. If this is the case, Tag-Files-Required would no longer be required.

A Note on the Help Attribute

APTrust uses the help attribute of each tag definition to provide tooltips in its graphical bagging library. These tips help users understand what information is expected in a tag field.

Members of the Beyond the Repository (BTR) project will soon be publishing a BagIt profile intended to be supported by all distributed digital preservation repositories (DDPs) in the US. They also want a new attribute similar to help in the tag defintions, though they are calling it definition. The name isn't as important as the presence of some inline documentation to help bag creators supply meaningful tag values.

Manifests-Allowed and Tag-Manifests-Allowed

BagIt profiles v1.2.0 defines Manifests-Required and Tag-Manifests-Required.

   "Manifests-Required":[
      "md5"
   ],
   "Tag-Manifests-Required": [
      "md5"
   ]

However, APTrust currently supports, and BTR plans to support, Manifests-Allowed and Tag-Manifests-Allowed. In both cases, this is for the practical purpose of making it as easy as possible for depositors to push content to DDPs.

APTrust found that some of its depositors' internal workflows already produced bags with md5 manifests, while others produced bags with sha256 manifests. To avoid making depositors redefine their internal workflows, APTrust started accepting bags with either md5 or sha256 manifests.

Our profile definition looked like this:

   "Manifests-Required":[],
   "Tag-Manifests-Required": [],
   "Manifests-Allowed":[
      "md5",
      "sha256"
   ],
   "Tag-Manifests-Allowed": [
      "md5",
      "sha256"
   ]

Since the BagIt specification says a bag must have a payload manifest, APTrust's validator does require a manifest, and will accept any one from the list.

BTR plans a similar definition, in order to make the process of moving data from local repos into DDPs as simple as possible. Their definition will likely support all commonly-implemented digest algorithms, like this:

   "Manifests-Allowed":[
        "md5",
        "sha1",
        "sha224",
        "sha256",
        "sha384",
        "sha512"
   ],
   "Tag-Manifests-Allowed": [
        "md5",
        "sha1",
        "sha224",
        "sha256",
        "sha384",
        "sha512"
   ]

Because APTrust, Chronopolis, DuraCloud, LOCKSS, and MetaArchive are all participating in the BTR grant, all may at some point be supporting the BTR profile idea of specifying allowed manifest algorithms without prescribing which specific one should be used.

Deserialization-Match-Required

Finally, APTrust has one request related to validating serialized bags. We currently enforce a recommendation that was part of version 14 of the BagIt spec but was later dropped. Section 4.2 of the old spec said:

The serialization SHOULD have the same name as the bag's base directory...

APTrust has always enforced a rule that these names MUST match. That is, if a tarred bag file is called photos.tar, it must untar to a single directory called photos.

APTrust and other DDPs typically untar bags in a staging area during the ingest process. When the bag photos.tar bag untars to a directory called photos, we can be sure its contents will not overwrite or commingle with the contents of another bags being processed at the same time.

Allowing bags to deserialize to arbitrary locations can cause problems. For example, if photos.tar, audio.tar, and video.tar all expand to directory called bag_contents and are all being ingested at the same time, contents of one bag can be mistaken for contents of another bag, and we wind up with a mess.

To prevent this, APTrust looks into serialized bags BEFORE deserializing them to ensure that the will expand into a directory with the same name. We reject bags that don't meet this rule.

Although the recommendation in Section 4.2 was dropped from the official BagIt spec, we would like BagIt profiles to provide a way to specify whether a valid bag must deserialize to an expected directory. This rule would only apply to serialized bags, and can default to false.

It has practical applications for DDPs and can vastly simplify the ingest process and the maintenance of the DDPs' staging area. APTrust is not the only DDP to use a staging area for bag validation. Chronopolis, Texas Digital Library, and Hathi Trust also used staging areas when they acted as DPN nodes, and DDPs will likely continue to use them in the future.

BTR and APTrust Change Requests

The BTR team will be submitting its comments and change requests separately in the coming weeks. Nothing in the BTR requests contradicts anything in the APTrust requests. The only difference so far is the name of the help/description attribute, and APTrust is flexible on that.

Included Sample Profiles

The sample profiles bagProfileBar.json and bagProfileFoo.json have been revised from the 1.2.0 spec to use the format of the proposed 2.0 spec.

Copy link
Member

ruebot left a comment

The Proposed-v2.0-Changes.md file needs to be removed from the PR, as does the img binary. The text of Proposed-v2.0-Changes.md should be the text of the initial PR comment here. This outlines your cases, and should be the foundation of the conversation there.

Proposed-v2.0-Changes.md Outdated Show resolved Hide resolved
@ruebot ruebot dismissed their stale review Sep 12, 2019

Updated

@jscancella

This comment has been minimized.

Copy link

jscancella commented Sep 19, 2019

I'm one of the authors of the Bagit specification for what that's worth and I would like to comment on these proposals. I really love the idea of bagit-profiles and implemented it in the latest bagit-java from the Library of Congress when I was employed there

Limitations of Tag Definitions in BagIt-Profiles v1.2.0

This bothers me for two reasons:

  1. Breaking changes should be breaking for a reason and I don't think a good enough reason has been supplied. I would much rather see a proposal that adds optional elements instead of changing elements that already exist. If in the future it is found that no one uses the old element, then propose removing it and causing a breaking change.
  2. This is assuming that others will want to have the exact same format of key:value tag files. I haven't heard a good argument on why you need this outside bag-info.txt which was created for information like this.

help tag

I am always for good documentation (which is really hard), so this gets a big thumbs up from me.

Manifests-Allowed and Tag-Manifests-Allowed

I really like this, since it allows for more flexibility in acceptable manifests. I wonder if there might also be a way to specify an ordering of preferred manifest algorithms? Something like [sha128, sha512, sha3] implying that the minimum accepted algorithm is sha128 but we will also accept sha512 or sha3.

Deserialization-Match-Required

The reason we took out the serialization was that we realized it is orthogonal to the purpose of Bagit which is to ensure all the files were delivered and that they haven't changed. There was also the thorny issue of ZIP and others behaving differently on different OSes. This particular use case seems too specific to the way you are ingesting bags in my opinion. You could easily solve this by just creating a temp folder with the name you are expecting and not have to deal with clobbering anything.

All these opinions are my own and do not reflect those of my employer. I also do NOT work the Library of Congress anymore, but I hope they are also looking at this and will comment.

@diamondap

This comment has been minimized.

Copy link
Author

diamondap commented Oct 3, 2019

Thanks for your well-explained response. I agree we can drop Deserialization-Match-Required. The bagit spec only says the serialized version should deserialize to a single directory, so we can leave it at that.

Adding an optional Tags attribute, as opposed to replacing "Bag-Info.txt" with "Tags" will work for our needs, though I hope it wouldn't impose an undue burden on existing BagIt tools. Some BagIt profiles, including APTrust and DPN, have used the key:value format in specially-named tag files like aptrust-info.txt and dpn-info.txt. The idea was that those organizations would stick to the key:value format that most bagging tools could already read and write in order to pass along information that was essential to the ingest process. I don't know why they decided to put required tags in a file outside of bag-info.txt. That decision was made before I joined the projects. But I suspect others may be doing this as well.

You'll likely be getting a pull request soon from the Beyond the Repository grant team asking for the help tag, Manifests-Allowed and Tag-Manifests-Allowed. We can wait to see how closely their request lines up with APTrust's request.

@jamiepb

This comment has been minimized.

Copy link

jamiepb commented Oct 3, 2019

I am also having trouble understanding why the tags must be in their own file rather than in the bag-info.txt file. Perhaps it's an easy way to distinguish which metadata set the bag uses, but that information could be included in the bag-info file. The State Archives of North Carolina developed our own profiles that we now use for state agency and local government records, and unless there was backward compatibility and we could continue to validate our existing bags, a breaking change would cause problems for us.

The help attribute sounds like a good idea, and the tag-manifests allowed sounds interesting.

@diamondap

This comment has been minimized.

Copy link
Author

diamondap commented Oct 3, 2019

I'm not sure why APTrust chose to put tags in an aptrust-info.txt file rather than in bag-info.txt. That decision was made before I came on board.

DPN chose to put DPN-specific tags in dpn-info.txt because depositors would send bags into one preservation repository (APTrust, TDL, Hathi Trust, or Chronopolis), and that repo would then replicate the bag to other repositories, with DPN-specific info in the dpn-info.txt file. The idea was to not pollute bag-info.txt with info that didn't come from the original depositor.

It sounds like this is non-standard practice. If that's the case, then there is indeed no need for the proposed Tags list, and APTrust should instead consider updating its BagIt profile to conform to standards.

I'd still like to hear if others out there require tag files outside of bag-info.txt, and if so, how they validate the contents of the tag files.

@diamondap diamondap mentioned this pull request Nov 6, 2019
@jwestgard

This comment has been minimized.

Copy link

jwestgard commented Nov 16, 2019

I would like to register my support for nos. 2 and 3 above. Regarding no. 1, it sound like perhaps moving APTrust tags into bag-info.txt might be an acceptable solution that avoids imposing a breaking change on all bagit users, and if so I would support that approach. I don't have strong feelings about no. 4 (deserialization) other than to say I support the motivation of the change. It sounds, however, like processing tools deserializing into a temp folder could be an acceptable alternative solution to avoid unintentional clobbering.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 21, 2019

@diamondap there appears to be a consensus against the breaking changes. Would you be amendable to update this pull request to only include the non-breaking changes? Then we can get some more feedback.

@diamondap

This comment has been minimized.

Copy link
Author

diamondap commented Nov 21, 2019

@ruebot I'm willing to remove the breaking changes. Doing so will make this pull request almost identical to PR #21 . The only difference is that the new proposed attribute called "help" in this pull request is called "description" in PR #21, which is fine with me.

@jscancella, @jwestgard, and @jamiepb should note that PR #21 contains the parts of this pull request they liked, without any of the breaking changes they objected to. Please go ahead and reject this request. We'll track the progress of PR #21.

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Nov 21, 2019

@diamondap cool. Once that gets in, feel free to add any additional PRs. I'll have ReSpec version up shortly after that one gets in (very early December probably) that should be a bit easier to work with hopefully.

All that said, really happy to see the interest in the spec, and really happy to see community work around it! 😃 😃 😃

@ruebot ruebot closed this Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.