parsing multiple Link: response headers #136

phonedude · Oct 11, 2018

Martin brought this to my attention. Here's a sample URL:

https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/

and it returns two different Link headers:

$ curl -IL https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/
HTTP/1.1 200 OK
Server: nginx/1.12.1
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
X-Archive-Orig-Server: nginx
Date: Tue, 09 Oct 2018 21:25:48 GMT
X-Archive-Orig-Transfer-Encoding: chunked
X-Archive-Orig-Connection: keep-alive
X-Archive-Orig-Strict-Transport-Security: max-age=86400
X-Archive-Orig-Vary: Accept-Encoding
X-Archive-Orig-Vary: Cookie
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
Link: https://wp.me/4cEB; rel=shortlink
X-Archive-Orig-Content-Encoding: gzip
X-ac: 3.sea _bur
Memento-Datetime: Tue, 09 Oct 2018 21:25:48 GMT
Link: https://ianmilligan.ca/; rel="original", https://scholarlyorphans.org/memento/https://ianmilligan.ca/; rel="timegate", https://scholarlyorphans.org/memento/timemap/link/https://ianmilligan.ca/; rel="timemap"; type="application/link-format", https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/; rel="memento"; datetime="Tue, 09 Oct 2018 21:25:48 GMT"; collection="memento"
Content-Location: https://scholarlyorphans.org/memento/20181009212548/https://ianmilligan.ca/
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'

Now, the first header is in error (it should be in X-Archive-orig-Link), but multiple Link headers are allowed as per RFC 2616 (and 7230). Martin said MementoEmbed wasn't finding link rel="original", probably bc it occurs in the 2nd Link header and not the first.

shawnmjones · Oct 11, 2018

Thanks for this.

MementoEmbed uses the requests library to find the values for the Link header. requests presents all HTTP headers as a case-insensitive dictionary. If a header is specified multiple times, requests is smart enough to combine the values together, so MementoEmbed does actually get all of the values for Link.

MementoEmbed/mementoembed/mementoresource.py

Lines 103 to 115 in 1246756

    
           def get_timegate_from_response(response): 
        
               urig = None 
        
               try: 
        
                   urig = aiu.convert_LinkTimeMap_to_dict( 
        
                       response.headers['link'] )['timegate_uri'] 
        
               except KeyError as e: 
        
                   raise NotAMementoError( 
        
                       "link header coult not be parsed for timegate URI", 
        
                       response=response, original_exception=e) 
        
               return urig

The problem exists in the function convert_LinkTimeMap_to_dict seen on lines 108-109 above. This function expects all relations to be surrounded by quotes (e.g., rel="timegate" is parseable, but rel=timegate fails). Link header values like <https://wp.me/4cEB>; rel=shortlink are a product of WordPress. They do not surround the argument to rel in quotes. Memento entries in the Link header, on the other hand, do surround the argument to rel in quotes (e.g., rel="timegate"). The convert_LinkTimeMap_to_dict function is stumbling over that shortlink relation because it has no quotes and it never gets to parse the rest of the string.

All examples in RFC 8288 - Web Linking use quotes, but section 3 states:

Note that any link-param can be generated with values using either
the token or the quoted-string syntax; therefore, recipients MUST be
able to parse both forms. In other words, the following parameters
are equivalent:

x=y
x="y"

and

Previous definitions of the Link header did not equate the token and
quoted-string forms explicitly; the title parameter was always
quoted, and the hreflang parameter was always a token. Senders
wishing to maximize interoperability will send them in those forms.

So, MementoEmbed needs to support both.

I have discovered a possible solution. The requests library has its own link format parsing function. When I tested this function a few years ago, it failed miserably on parsing Memento headers, but a recent test last weekend indicates that it may have matured enough for us to use here.

shawnmjones · Oct 11, 2018

It looks like the requests library implementation works.

I will have to test with some other URI-Ms.

shawnmjones added the bug label Oct 11, 2018

shawnmjones added a commit that referenced this issue Oct 11, 2018

fixes for link header problems identified in #136

190c8c0

shawnmjones self-assigned this Oct 11, 2018

shawnmjones closed this in 72ed625 Oct 11, 2018

oduwsdl/MementoEmbed

parsing multiple Link: response headers #136

parsing multiple Link: response headers #136

phonedude commented Oct 11, 2018

This comment has been minimized.

shawnmjones commented Oct 11, 2018

shawnmjones added the bug label Oct 11, 2018

This comment has been minimized.

shawnmjones commented Oct 11, 2018

shawnmjones added a commit that referenced this issue Oct 11, 2018

shawnmjones self-assigned this Oct 11, 2018

shawnmjones closed this in `72ed625` Oct 11, 2018

shawnmjones added a commit that referenced this issue Oct 11, 2018

oduwsdl/MementoEmbed

Join GitHub today

parsing multiple Link: response headers #136

Comments

phonedude commented Oct 11, 2018

This comment has been minimized.

shawnmjones commented Oct 11, 2018

shawnmjones added the bug label Oct 11, 2018

This comment has been minimized.

shawnmjones commented Oct 11, 2018

shawnmjones added a commit that referenced this issue Oct 11, 2018

shawnmjones self-assigned this Oct 11, 2018

shawnmjones closed this in 72ed625 Oct 11, 2018

shawnmjones added a commit that referenced this issue Oct 11, 2018

shawnmjones closed this in `72ed625` Oct 11, 2018