Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing multiple Link: response headers #136

Closed
phonedude opened this issue Oct 11, 2018 · 2 comments

Comments

Projects
None yet
2 participants
@phonedude
Copy link
Member

commented Oct 11, 2018

Martin brought this to my attention. Here's a sample URL:

https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/

and it returns two different Link headers:

$ curl -IL https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/
HTTP/1.1 200 OK
Server: nginx/1.12.1
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
X-Archive-Orig-Server: nginx
Date: Tue, 09 Oct 2018 21:25:48 GMT
X-Archive-Orig-Transfer-Encoding: chunked
X-Archive-Orig-Connection: keep-alive
X-Archive-Orig-Strict-Transport-Security: max-age=86400
X-Archive-Orig-Vary: Accept-Encoding
X-Archive-Orig-Vary: Cookie
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
Link: https://wp.me/4cEB; rel=shortlink
X-Archive-Orig-Content-Encoding: gzip
X-ac: 3.sea _bur
Memento-Datetime: Tue, 09 Oct 2018 21:25:48 GMT
Link: https://ianmilligan.ca/; rel="original", https://scholarlyorphans.org/memento/https://ianmilligan.ca/; rel="timegate", https://scholarlyorphans.org/memento/timemap/link/https://ianmilligan.ca/; rel="timemap"; type="application/link-format", https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/; rel="memento"; datetime="Tue, 09 Oct 2018 21:25:48 GMT"; collection="memento"
Content-Location: https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'

Now, the first header is in error (it should be in X-Archive-orig-Link), but multiple Link headers are allowed as per RFC 2616 (and 7230). Martin said MementoEmbed wasn't finding link rel="original", probably bc it occurs in the 2nd Link header and not the first.

@shawnmjones

This comment has been minimized.

Copy link
Collaborator

commented Oct 11, 2018

Thanks for this.

MementoEmbed uses the requests library to find the values for the Link header. requests presents all HTTP headers as a case-insensitive dictionary. If a header is specified multiple times, requests is smart enough to combine the values together, so MementoEmbed does actually get all of the values for Link.

def get_timegate_from_response(response):
urig = None
try:
urig = aiu.convert_LinkTimeMap_to_dict(
response.headers['link'] )['timegate_uri']
except KeyError as e:
raise NotAMementoError(
"link header coult not be parsed for timegate URI",
response=response, original_exception=e)
return urig

The problem exists in the function convert_LinkTimeMap_to_dict seen on lines 108-109 above. This function expects all relations to be surrounded by quotes (e.g., rel="timegate" is parseable, but rel=timegate fails). Link header values like <https://wp.me/4cEB>; rel=shortlink are a product of WordPress. They do not surround the argument to rel in quotes. Memento entries in the Link header, on the other hand, do surround the argument to rel in quotes (e.g., rel="timegate"). The convert_LinkTimeMap_to_dict function is stumbling over that shortlink relation because it has no quotes and it never gets to parse the rest of the string.

All examples in RFC 8288 - Web Linking use quotes, but section 3 states:

Note that any link-param can be generated with values using either
the token or the quoted-string syntax; therefore, recipients MUST be
able to parse both forms. In other words, the following parameters
are equivalent:

x=y
x="y"

and

Previous definitions of the Link header did not equate the token and
quoted-string forms explicitly; the title parameter was always
quoted, and the hreflang parameter was always a token. Senders
wishing to maximize interoperability will send them in those forms.

So, MementoEmbed needs to support both.

I have discovered a possible solution. The requests library has its own link format parsing function. When I tested this function a few years ago, it failed miserably on parsing Memento headers, but a recent test last weekend indicates that it may have matured enough for us to use here.

@shawnmjones shawnmjones added the bug label Oct 11, 2018

@shawnmjones

This comment has been minimized.

Copy link
Collaborator

commented Oct 11, 2018

It looks like the requests library implementation works.

image

I will have to test with some other URI-Ms.

@shawnmjones shawnmjones self-assigned this Oct 11, 2018

shawnmjones added a commit that referenced this issue Oct 11, 2018

Merge pull request #137 from oduwsdl/link-header-parsing-fix
Fixes #136, link headers should now parse correctly in all cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.