Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid format exception in scanJobLog #239

Open
ato opened this Issue Mar 11, 2019 · 4 comments

Comments

Projects
None yet
3 participants
@ato
Copy link
Collaborator

ato commented Mar 11, 2019

@ruebot encountered the following exception after checkpointing and restarting Heritrix:

2019-03-11 14:08:53.599 SEVERE thread-1 org.archive.crawler.framework.Engine.addJobDirectory() bad cxml: /data/heritrix-jobs/academic-calendars/crawler-beans.cxml
java.lang.IllegalArgumentException: Invalid format: "539Z" is malformed at "Z"
    at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:634)
    at org.joda.time.convert.StringConverter.getInstantMillis(StringConverter.java:65)
    at org.joda.time.base.BaseDateTime.<init>(BaseDateTime.java:171)
    at org.joda.time.DateTime.<init>(DateTime.java:168)
    at org.archive.crawler.framework.CrawlJob.scanJobLog(CrawlJob.java:179)
    at org.archive.crawler.framework.CrawlJob.<init>(CrawlJob.java:101)
    at org.archive.crawler.framework.Engine.addJobDirectory(Engine.java:153)
    at org.archive.crawler.framework.Engine.findJobConfigs(Engine.java:109)
    at org.archive.crawler.framework.Engine.<init>(Engine.java:72)
    at org.archive.crawler.Heritrix.instanceMain(Heritrix.java:335)
    at org.archive.crawler.Heritrix.main(Heritrix.java:188)

Looks like a bug here:

startPosition = jobLog.length()-(FileUtils.ONE_KB * 100);

If the job log is larger than 100KB, startPosition is set to 100KB from the end which might be in the middle of a line. If that partial line still happens to match Pattern.compile("(\\S+) (\\S+) Job launched") then an incomplete timestamp may be parsed causing the exception.

@anjackson suggests the following fix:

it should read a line and discard it if the startPosition is not zero.

@ato ato added the bug label Mar 11, 2019

anjackson added a commit to ukwa/heritrix3 that referenced this issue Mar 11, 2019

@anjackson

This comment has been minimized.

Copy link
Collaborator

anjackson commented Mar 11, 2019

I have a patch but haven't tested it yet.

@ruebot

This comment has been minimized.

Copy link
Collaborator

ruebot commented Mar 11, 2019

@anjackson happy to test with that jobs.log file on another machine if you need me to.

@anjackson

This comment has been minimized.

Copy link
Collaborator

anjackson commented Mar 12, 2019

@ruebot

This comment has been minimized.

Copy link
Collaborator

ruebot commented Mar 12, 2019

Good to go on my end:

$ tail -f job.log
2019-03-01T12:01:32.130Z INFO Job instantiated
2019-03-01T12:01:42.293Z INFO Job launched
2019-03-01T12:01:43.923Z INFO PREPARING 20190301120143
2019-03-01T12:01:43.955Z INFO PAUSED 20190301120143
2019-03-01T12:01:54.079Z INFO RUNNING 20190301120143
2019-03-01T12:02:01.454Z WARNING nowhere to log added seed: http://calendars.students.yorku.ca/ (in thread 'ToeThread #21: http://calendars.registrar.yorku.ca/'; in processor 'candidates')
2019-03-01T12:02:07.490Z WARNING nowhere to log added seed: https://calendars.students.yorku.ca/ (in thread 'ToeThread #21: http://calendars.students.yorku.ca/'; in processor 'candidates')
2019-03-11T12:26:03.810Z INFO CHECKPOINTED cp00001-20190311122554
2019-03-11T12:26:06.453Z INFO PAUSING 20190301120143
2019-03-11T12:26:07.702Z INFO PAUSED 20190301120143
$ cat heritrix_out.log
Tue Mar 12 08:32:54 EDT 2019 Starting heritrix
Linux wombat 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
JAVA_OPTS= -Xmx256m
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256586
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256586
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Oracle Corporation OpenJDK Runtime Environment 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12

Warning:
The JKS keystore uses a proprietary format. It is recommended to migrate to PKCS12 which is an industry standard format using "keytool -importkeystore -srckeystore adhoc.keystore -destkeystore adhoc.keystore -deststoretype pkcs12".
Using ad-hoc HTTPS certificate with fingerprint...
SHA1:E7:65:F2:C6:5A:90:A0:79:61:FF:28:1F:24:81:A6:74:D0:38:2C:0B
Verify in browser before accepting exception.
2019-03-12 12:32:55.523 INFO thread-1 org.archive.crawler.framework.Engine.addJobDirectory() added crawl job: academic-calendars
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
engine listening at port 9191
operator login set per command-line
NOTE: We recommend a longer, stronger password, especially if your web
interface will be internet-accessible.
Heritrix version: 3.4.0-SNAPSHOT-2019-03-11T21:36:31Z
log4j:WARN No appenders could be found for logger (freemarker.cache).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.