Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest: Provide more robust ingest for Excel and CSV #585

Open
eaquigley opened this issue Jul 9, 2014 · 39 comments

Comments

Projects
None yet
@eaquigley
Copy link
Contributor

commented Jul 9, 2014


Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 4014, https://redmine.hmdc.harvard.edu/issues/4014
Original Date: 2014-05-21
Original Assignee: Eleni Castro


The initial Excel ingest implementation is a proof of concept -it supports rectangular columns and rows of string and numeric types.

More functionality may be required, for example dates, missing value support, variable labels, but it is unclear what is typically expected from researchers who use Excel to store their data.

This ticket is a place holder to investigate a concrete set of requirements for a more robust Excel ingest.

As a side note, there are a number of "How to use Excel for Research Data" guides posted by various universities out there that may answer some questions and provide a starting point.

@eaquigley eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014

@scolapasta scolapasta modified the milestones: Beta 3 - Dataverse 4.0, In Review - Dataverse 4.0 Jul 15, 2014

@eaquigley

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2014

@posixeleni Any update on this?

@posixeleni

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2014

@eaquigley Will need to review the documentation mentioned above to see if it can help.

@landreev have you had any luck with this?

@posixeleni

This comment has been minimized.

Copy link
Contributor

commented Sep 12, 2014

Will need to move this to another milestone.

@posixeleni posixeleni modified the milestones: Dataverse 4.0: Final, Beta 8 - Dataverse 4.0, Beta 10 - Dataverse 4.0 Oct 15, 2014

@posixeleni posixeleni assigned scolapasta and unassigned posixeleni Oct 15, 2014

@posixeleni

This comment has been minimized.

Copy link
Contributor

commented Oct 15, 2014

@kcondon @landreev Found some best practices on how we can ingest Excel files https://oit.utk.edu/research/documentation/Documents/HowToUseExcelForDataEntry.pdf but not sure if these would be applied across the board by all researchers and how we would handle them exactly in Dataverse:

  • All your data should be in a single spreadsheet of a single file. [Eleni: what would we do if someone sent us an excel spreadsheet with multiple workbooks? only load the first one and put in our documentation that this is what we do?]
  • Enter variable names in the first row of the spreadsheet.
  • Variable names cannot contain spaces, but may use the underscore character. [Eleni: What would we do if variable (header) names did contains spaces? Replace spaces with underscores?]
  • No blank rows should appear in the data. [Eleni: Would we programmatically remove all blank rows?]
  • For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, stat packages will read the column as character data so you
    won’t, for example, be able to calculate even the mean of a column. [Eleni: can we make sure we record these as N/A values?]
  • You can enter dates with slashes (6/13/2003) and times with colons (12:15 AM). [Eleni: Is this something our system can look for and store them as dates and times?]

@eaquigley Assigning to @scolapasta to see if this is something we can schedule for 4.0 or if it needs to be moved to 4.1.

@scolapasta scolapasta modified the milestones: In Review - Dataverse 4.0, Beta 10 - Dataverse 4.0 Dec 3, 2014

@scolapasta

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2015

Probably best to just move this to 4.1, but first checking with @landreev if there are any low hanging fruit here

@scolapasta scolapasta assigned landreev and unassigned scolapasta Jan 6, 2015

@scolapasta scolapasta modified the milestones: Beta 11 - Dataverse 4.0, In Review - Dataverse 4.0 Jan 6, 2015

@scolapasta scolapasta modified the milestones: Beta 11 - Dataverse 4.0, Dataverse 4.0: Final Jan 23, 2015

@jggautier

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2017

Just ran into a guide from Cornell about preparing tabular data for archiving: https://data.research.cornell.edu/content/tabular-data. (Looked but couldn't find anything similar at Harvard.) Could be helpful if we decided to improve the docs.

@mercecrosas

This comment has been minimized.

Copy link
Member

commented Feb 8, 2017

@jggautier

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2017

In a Slack conversation with @thegaryking about better support for CSV ingest, @landreev wrote:

the current implementation does not allow new lines embedded in text fields. Because we expect every line to contain the same number of comma-separated fields. In that particular file [that failed ingest] there are opening and trailing quotes, before and after the chunk of text with new lines... So yes, it can still be parsed unambiguously.

Should be a straightforward fix. We may have just assumed that new lines would not be meaningful to anybody in tabular data. So stripping them and/or replacing them with spaces would make an ingestible CSV file.

@bencomp

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2017

Have you considered delegating CSV support to external libraries like Apache Commons CSV? I see it is listed as a dependency for Dataverse, but the ingest apparently still depends on @landreev's own code.

@pdurbin

This comment has been minimized.

Copy link
Member

commented Apr 5, 2017

@bencomp I'm not sure why we invented our own way of parsing CSV files within the Dataverse code base. I agree with you that if we can leverage an open source package that already does this job, it would probably be better. Maybe there's some Dataverse-specific stuff in our CSV parsing. I don't know. @landreev would.

Yesterday @raprasad was showing us how Python's Pandas can parse Excel files: https://github.com/IQSS/miniverse/blob/729a8d1cb0bededbcc9882e51ccf5b874b1a178e/dv_apps/datafiles/tabular_previewer.py#L107

There's also a suggestion to have ingest run on a separate service rather than as part of the Dataverse monolith: #2331.

@landreev

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2017

There was some historical reason why we insisted on one line per one observation requirement. But it doesn't appear to be relevant anymore.
Also, when that original implementation was put together, we did not anticipate that anyone would use csv ingest for real, full text - with rich punctuation, extra tabs, commas and newlines. Since this appears to be a requirement now, yes, switching to an existing third party parser that takes care of all these things seems like the right thing to do.

@landreev

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2017

I'm opening a new issue for some concrete incremental improvements of the CSV ingest, to be estimated and included into some sprint in the short term. (This one is just too generic)

@landreev

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2017

#3767 - a child issue I created just for rewriting the CSV plugin to use more standard (and permissive) parsing.

@oscardssmith

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2017

Does #3767 close this?

@adam3smith

This comment has been minimized.

Copy link

commented Mar 7, 2018

I have two additions on this -- there are a couple of tickets on Excel ingest, this seemed the most appropriate one but please redirect if I missed something

  1. .tab vs. .tsv: I like that Dataverse normalized .xlsx files and I think tab separated values are a good, robust choice, but why did you choose the .tab file extension (which is ambiguous and very rarely used) over the much more common .tsv?
  2. The download dialog for ingested tabular files has an option to download the "original" file. I think that should be the original xlsx file, as uploaded. Currently it's just the .tab file again.
@pdurbin

This comment has been minimized.

Copy link
Member

commented Mar 7, 2018

@adam3smith thanks, these are related:

  • Switch preservation format extension from .tab to .tsv #2720
  • Default Download Type should be "original" #4000
@adam3smith

This comment has been minimized.

Copy link

commented Mar 7, 2018

Great. #2720 addresses what I want with 1. but while I agree with Alex in #4000 , my point about Excel is different: the original format is completely lost when uploading Excel. I have no way to get out the Excel file I uploaded after ingest (at least I'm not finding it). That may actually be a bug?

@jggautier

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2018

Hi @adam3smith. The second point sounds like a bug. Could you link to a dataset with an ingested Excel file where we're not able to download the original format?

This quick search returns a few ingested Excel files in Harvard Dataverse (4.8.4), but I'm able to download the original format for them: https://dataverse.harvard.edu/dataverse/harvard?q=fileName%3Aexcel&fq0=fileTypeGroupFacet%3A%22tabulardata%22.

I just uploaded an .xlsx file on Harvard Dataverse that ingested and was able to download the original file format.

screen shot 2018-03-07 at 12 14 42 pm

@adam3smith

This comment has been minimized.

Copy link

commented Mar 7, 2018

Sorry for the false alarm -- this works correctly for me not just in demo.dataverse but also in our dev instance on 4.8. It seems broken the way I describe, though, in our beta environment running 4.6.2. So may have gotten fixed at some point?

@mheppler

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2018

@adam3smith there were a lot of ingest improvements included in our 4.8 release, which could account for improved functionality between demo (4.8.4) and your 4.6.2.

@djbrooke djbrooke added this to Inbox 🗄 in IQSS/dataverse May 8, 2019

@landreev

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

Since this issue was opened, it has been addressed for the CSV ingest plugin, that was much improved.
I propose we close this issue, but open a new separate issue for the Excel plugin. And try to address it asap (there is a bunch of open Excel issues for various things the current plugin does wrong)

@pdurbin

This comment has been minimized.

Copy link
Member

commented May 31, 2019

@landreev yeah, I agree that this issue should probably be closed. It's one of our oldest open issues and doesn't fit into our modern "small chunks" style. #3382 ("Excel ingest can't accept more than 26 (Z) columns") is an example of a small chunk and @qqmyers make pull request #5891 for it yesterday. (Thank you!! 🎉 ) Then @landreev opened #5896 about adding tests for Excel. This is perfect. Small chunks move across our (new) board at https://github.com/orgs/IQSS/projects/2 much faster than large projects.

@adam3smith you seem to be fairly engaged in this issue. Are you willing or able to create a "small chunk" issue or two for some of your ideas above?

Everyone else is very welcome to create new issues as well, of course! The idea is that the issues should be relatively small and targeted. Incremental improvement. Small chunks.

@kcondon

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

@pdurbin @landreev I opened this issue originally because the initial implementation was very basic because we did not have a good idea of what a set of features we should support would be, given the wide variety of practices. It was, at least initially, a research spike to figure that out, then become something specific. We lucked out there was a csv implementation so I agree that part is resolved. I think Leonid's suggestion of opening a new Excel-specific implementation sounds reasonable and I'm happy it has specifics since when I opened this there was not an agreed upon set of specifics. So, 1. open new ticket with specifics, then 2. close this ticket ;)

@adam3smith

This comment has been minimized.

Copy link

commented May 31, 2019

@pdurbin - I just had two issues here: One was resolved, the other one is fully captured by #2720 , so I'm all good with closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.