New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARCs with datetime before year 1900 cause error in indexer #603

Open
machawk1 opened this Issue Jan 26, 2019 · 3 comments

Comments

Projects
None yet
2 participants
@machawk1
Copy link
Member

machawk1 commented Jan 26, 2019

ipwb index /Path/tofb_fab_dates.warc
Traceback (most recent call last):dates.warc: 3/5
  File "/Users/machawk1/Library/Python/2.7/bin/ipwb", line 11, in <module>
    sys.exit(main())
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/__main__.py", line 18, in main
    args = checkArgs(sys.argv)
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/__main__.py", line 165, in checkArgs
    results.func(results)
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/__main__.py", line 33, in checkArgs_index
    debug=args.debug)
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/indexer.py", line 170, in indexFileAt
    warcFileFullPath, **encryptionAndCompressionSetting)
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/indexer.py", line 287, in getCDXJLinesFromFile
    record.rec_headers.get_header('WARC-Date'))
  File "/Users/machawk1/Library/Python/2.7/lib/python/site-packages/ipwb/util.py", line 149, in iso8601ToDigits14
    return d.strftime('%Y%m%d%H%M%S')
ValueError: year=2 is before 1900; the datetime strftime() methods require year >= 1900

fb_fab_dates 2.warc.txt

@machawk1

This comment has been minimized.

Copy link
Member

machawk1 commented Jan 26, 2019

This seems to be an issue with strftime with potential solutions provided here.

@shawnmjones

This comment has been minimized.

Copy link

shawnmjones commented Jan 27, 2019

When would a WARC have a datetime prior to the year 1900?

@machawk1

This comment has been minimized.

Copy link
Member

machawk1 commented Jan 27, 2019

@shawnmjones A WARC generated through conventional means should not, since 1900 predates the creation of the WARC spec and the Web. The WARC spec cites the W3C profile of the ISO W3C profile of ISO 8601:1988 spec as the WARC-Date basis. Dates prior to 1900 are legal here, so should not cause an exception.

However, the interpretation of a dates prior to 1900 in this field is likely due to a misinterpretation, misconfiguration, or a fabricated example, as attached ↑.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment