Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with pushing mementos into Internet Archive #43

Open
shawnmjones opened this issue Apr 1, 2020 · 3 comments
Open

Problems with pushing mementos into Internet Archive #43

shawnmjones opened this issue Apr 1, 2020 · 3 comments

Comments

@shawnmjones
Copy link
Member

@shawnmjones shawnmjones commented Apr 1, 2020

I noticed this when I was using ArchiveNow this morning.

# archivenow www.foxnews.com
Error (The Internet Archive): 445 Client Error:  for url: https://web.archive.org/save/www.foxnews.com

If I add a user agent to the arguments to the requests.get on line 15 of archivenow/archivenow/handlers/ia_handler.py then it works.

r = requests.get(uri, timeout=120, allow_redirects=True)

I'm uncertain as to how you want to handle the user specifying their own user agent. The existing --agent argument appears to be for specifying which tool the user desires to employ for creating WARCs. Also, there doesn't appear to be a way to submit changes to any of the request headers in archivenow/archivenow.py.

As I'm calling ArchiveNow within Python code, I would prefer an available parameter to the push function on line 129 of archivenow/archivenow.py.

def push(URI, arc_id, p_args={}):
global handlers
global res_uris
try:
# push to all possible archives
res_uris_idx = str(uuid.uuid4())
res_uris[res_uris_idx] = []
### if arc_id == 'all':
### for handler in handlers:
### if (handlers[handler].api_required):
# pass args like key API
### res.append(handlers[handler].push(str(URI), p_args))
### else:
### res.append(handlers[handler].push(str(URI)))
### else:
# push to the chosen archives
threads = []
for handler in handlers:
if (arc_id == handler) or (arc_id == 'all'):
### if (arc_id == handler): ### and (handlers[handler].api_required):
#res.append(handlers[handler].push(str(URI), p_args))
#push_proxy( handlers[handler], str(URI), p_args, res_uris_idx)
threads.append(Thread(target=push_proxy, args=(handlers[handler],str(URI), p_args, res_uris_idx,)))
### elif (arc_id == handler):
### res.append(handlers[handler].push(str(URI)))
for th in threads:
th.start()
for th in threads:
th.join()
res = res_uris[res_uris_idx]
del res_uris[res_uris_idx]
return res
except:
del res_uris[res_uris_idx]
pass
return ["bad request"]

For example, we could have:

def push(URI, arc_id, p_args={}, headers={}):

where the user can override any of the request headers by assigning them as a dictionary to the headers parameter. This dictionary would have to be re-submitted through the code on line 154 to the function executed via multithreading.

I haven't submitted a pull request yet because all handlers would need to be updated to receive and act on this parameter. I'm not sure of the implications of that.

@shawnmjones shawnmjones changed the title Issue pushing mementos into Internet Archive Problems with pushing mementos into Internet Archive Apr 1, 2020
@maturban

This comment has been minimized.

Copy link
Member

@maturban maturban commented Apr 2, 2020

Thanks for providing details about the problem.

Do you have any suggestion for how the user can provide headers? For example:

archivenow http://www.example.com --header='{"User-Agent": "Mozilla/5.0 (Windows NT 6.1)", "Accept-Charset": "utf-8"}'

@maturban

This comment has been minimized.

Copy link
Member

@maturban maturban commented Apr 2, 2020

The user-agent is hard coded in the Internet Archive handler (i.e., archivenow/archivenow/handlers/ia_handler.py) for now.

@machawk1

This comment has been minimized.

Copy link
Member

@machawk1 machawk1 commented Apr 2, 2020

@maturban MemGator has some logic of allowing users to specify user-agent through the command-line. I think simply allowing a string with some semantic CLI flag (e.g., MemGator's --agent/-a) would make specifying this value more straightforward to users.

@ibnesayeed might have an opinion on this as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.