Ingest: Provide more robust ingest for Excel and CSV #585

eaquigley · Jul 9, 2014

Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 4014, https://redmine.hmdc.harvard.edu/issues/4014
Original Date: 2014-05-21
Original Assignee: Eleni Castro

The initial Excel ingest implementation is a proof of concept -it supports rectangular columns and rows of string and numeric types.

More functionality may be required, for example dates, missing value support, variable labels, but it is unclear what is typically expected from researchers who use Excel to store their data.

This ticket is a place holder to investigate a concrete set of requirements for a more robust Excel ingest.

As a side note, there are a number of "How to use Excel for Research Data" guides posted by various universities out there that may answer some questions and provide a starting point.

eaquigley · Aug 27, 2014

@posixeleni Any update on this?

posixeleni · Sep 2, 2014

@eaquigley Will need to review the documentation mentioned above to see if it can help.

@landreev have you had any luck with this?

posixeleni · Sep 12, 2014

Will need to move this to another milestone.

posixeleni · Oct 15, 2014

@kcondon @landreev Found some best practices on how we can ingest Excel files https://oit.utk.edu/research/documentation/Documents/HowToUseExcelForDataEntry.pdf but not sure if these would be applied across the board by all researchers and how we would handle them exactly in Dataverse:

All your data should be in a single spreadsheet of a single file. [Eleni: what would we do if someone sent us an excel spreadsheet with multiple workbooks? only load the first one and put in our documentation that this is what we do?]
Enter variable names in the first row of the spreadsheet.
Variable names cannot contain spaces, but may use the underscore character. [Eleni: What would we do if variable (header) names did contains spaces? Replace spaces with underscores?]
No blank rows should appear in the data. [Eleni: Would we programmatically remove all blank rows?]
For missing values, leave the cell blank. Although SPSS and SAS use a period to represent a missing value, if you actually type a period in Excel, stat packages will read the column as character data so you
won’t, for example, be able to calculate even the mean of a column. [Eleni: can we make sure we record these as N/A values?]
You can enter dates with slashes (6/13/2003) and times with colons (12:15 AM). [Eleni: Is this something our system can look for and store them as dates and times?]

@eaquigley Assigning to @scolapasta to see if this is something we can schedule for 4.0 or if it needs to be moved to 4.1.

scolapasta · Jan 6, 2015

Probably best to just move this to 4.1, but first checking with @landreev if there are any low hanging fruit here

jggautier · Feb 7, 2017

Just ran into a guide from Cornell about preparing tabular data for archiving: https://data.research.cornell.edu/content/tabular-data. (Looked but couldn't find anything similar at Harvard.) Could be helpful if we decided to improve the docs.

mercecrosas · Feb 8, 2017

This is a nice documentation piece, but it's written too much from only the perspective of archiving, for what we need. I agree that the docs should include this type of documentation on working with files, and files to upload, but we should write from the point of view of a researcher or data author - a researcher working with quantitative data usually will already work with R, stata or similar data package. If he/she doesn't use any of those, recommending what you should do when creating the CSV file can be helpful. Let's discuss this - it would be a very useful exercise to write this as part of the docs. Merce

…

---------- Mercè Crosas, Ph.D., Chief Data Science and Technology Officer, IQSS, Harvard University @mercecrosas mercecrosas.com

On Tue, Feb 7, 2017 at 4:37 PM, Julian Gautier ***@***.***> wrote: Just ran into a guide from Cornell about preparing tabular data for archiving: https://data.research.cornell.edu/content/tabular-data <https://urldefense.proofpoint.com/v2/url?u=https-3A__data.research.cornell.edu_content_tabular-2Ddata&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=nPWg2rii7w3D3xfGqiBflbCOe1evplsYGXY2WfYAB9Y&s=lYQrq__64UAge9bPijJP4y7ulEy53ZHYYMIzvA4czsE&e=>. (Looked but couldn't find anything similar at Harvard.) Could be helpful if we decided to improve the docs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_585-23issuecomment-2D278148359&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=nPWg2rii7w3D3xfGqiBflbCOe1evplsYGXY2WfYAB9Y&s=c3uC5537VLE_ADyX-vNg_qZHlkq8PWelNatDqnN8yM8&e=>, or mute the thread <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AApQyH5khQ9-5FmH9e00Xcf6PG7JWtWeSPks5raOQbgaJpZM4CLnV5&d=CwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=n9HCCtgqDPssu5vpqjbO3q4h2g6vMeTOp0Ez7NsdVFM&m=nPWg2rii7w3D3xfGqiBflbCOe1evplsYGXY2WfYAB9Y&s=9sofGdEA35Scnz6bmI4gX6MyXiHoTAqwsiqk7s8jZHY&e=> .

jggautier · Apr 4, 2017

In a Slack conversation with @thegaryking about better support for CSV ingest, @landreev wrote:

the current implementation does not allow new lines embedded in text fields. Because we expect every line to contain the same number of comma-separated fields. In that particular file [that failed ingest] there are opening and trailing quotes, before and after the chunk of text with new lines... So yes, it can still be parsed unambiguously.

Should be a straightforward fix. We may have just assumed that new lines would not be meaningful to anybody in tabular data. So stripping them and/or replacing them with spaces would make an ingestible CSV file.

bencomp · Apr 5, 2017

Have you considered delegating CSV support to external libraries like Apache Commons CSV? I see it is listed as a dependency for Dataverse, but the ingest apparently still depends on @landreev's own code.

pdurbin · Apr 5, 2017

@bencomp I'm not sure why we invented our own way of parsing CSV files within the Dataverse code base. I agree with you that if we can leverage an open source package that already does this job, it would probably be better. Maybe there's some Dataverse-specific stuff in our CSV parsing. I don't know. @landreev would.

Yesterday @raprasad was showing us how Python's Pandas can parse Excel files: https://github.com/IQSS/miniverse/blob/729a8d1cb0bededbcc9882e51ccf5b874b1a178e/dv_apps/datafiles/tabular_previewer.py#L107

There's also a suggestion to have ingest run on a separate service rather than as part of the Dataverse monolith: #2331.

landreev · Apr 11, 2017

There was some historical reason why we insisted on one line per one observation requirement. But it doesn't appear to be relevant anymore.
Also, when that original implementation was put together, we did not anticipate that anyone would use csv ingest for real, full text - with rich punctuation, extra tabs, commas and newlines. Since this appears to be a requirement now, yes, switching to an existing third party parser that takes care of all these things seems like the right thing to do.

landreev · Apr 11, 2017

I'm opening a new issue for some concrete incremental improvements of the CSV ingest, to be estimated and included into some sprint in the short term. (This one is just too generic)

landreev · Apr 12, 2017

#3767 - a child issue I created just for rewriting the CSV plugin to use more standard (and permissive) parsing.

oscardssmith · Jul 20, 2017

Does #3767 close this?

adam3smith · Mar 7, 2018

I have two additions on this -- there are a couple of tickets on Excel ingest, this seemed the most appropriate one but please redirect if I missed something

.tab vs. .tsv: I like that Dataverse normalized .xlsx files and I think tab separated values are a good, robust choice, but why did you choose the .tab file extension (which is ambiguous and very rarely used) over the much more common .tsv?
The download dialog for ingested tabular files has an option to download the "original" file. I think that should be the original xlsx file, as uploaded. Currently it's just the .tab file again.

pdurbin · Mar 7, 2018

@adam3smith thanks, these are related:

Switch preservation format extension from .tab to .tsv #2720
Default Download Type should be "original" #4000

adam3smith · Mar 7, 2018

Great. #2720 addresses what I want with 1. but while I agree with Alex in #4000 , my point about Excel is different: the original format is completely lost when uploading Excel. I have no way to get out the Excel file I uploaded after ingest (at least I'm not finding it). That may actually be a bug?

jggautier · Mar 7, 2018

Hi @adam3smith. The second point sounds like a bug. Could you link to a dataset with an ingested Excel file where we're not able to download the original format?

This quick search returns a few ingested Excel files in Harvard Dataverse (4.8.4), but I'm able to download the original format for them: https://dataverse.harvard.edu/dataverse/harvard?q=fileName%3Aexcel&fq0=fileTypeGroupFacet%3A%22tabulardata%22.

I just uploaded an .xlsx file on Harvard Dataverse that ingested and was able to download the original file format.

adam3smith · Mar 7, 2018

Sorry for the false alarm -- this works correctly for me not just in demo.dataverse but also in our dev instance on 4.8. It seems broken the way I describe, though, in our beta environment running 4.6.2. So may have gotten fixed at some point?

mheppler · Mar 7, 2018

@adam3smith there were a lot of ingest improvements included in our 4.8 release, which could account for improved functionality between demo (4.8.4) and your 4.6.2.

landreev · May 31, 2019

Since this issue was opened, it has been addressed for the CSV ingest plugin, that was much improved.
I propose we close this issue, but open a new separate issue for the Excel plugin. And try to address it asap (there is a bunch of open Excel issues for various things the current plugin does wrong)

pdurbin · May 31, 2019

@landreev yeah, I agree that this issue should probably be closed. It's one of our oldest open issues and doesn't fit into our modern "small chunks" style. #3382 ("Excel ingest can't accept more than 26 (Z) columns") is an example of a small chunk and @qqmyers make pull request #5891 for it yesterday. (Thank you!! 🎉 ) Then @landreev opened #5896 about adding tests for Excel. This is perfect. Small chunks move across our (new) board at https://github.com/orgs/IQSS/projects/2 much faster than large projects.

@adam3smith you seem to be fairly engaged in this issue. Are you willing or able to create a "small chunk" issue or two for some of your ideas above?

Everyone else is very welcome to create new issues as well, of course! The idea is that the issues should be relatively small and targeted. Incremental improvement. Small chunks.

kcondon · May 31, 2019

@pdurbin @landreev I opened this issue originally because the initial implementation was very basic because we did not have a good idea of what a set of features we should support would be, given the wide variety of practices. It was, at least initially, a research spike to figure that out, then become something specific. We lucked out there was a csv implementation so I agree that part is resolved. I think Leonid's suggestion of opening a new Excel-specific implementation sounds reasonable and I'm happy it has specifics since when I opened this there was not an agreed upon set of specifics. So, 1. open new ticket with specifics, then 2. close this ticket ;)

adam3smith · May 31, 2019

@pdurbin - I just had two issues here: One was resolved, the other one is fully captured by #2720 , so I'm all good with closing this.

eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014

eaquigley assigned posixeleni Jul 9, 2014

scolapasta modified the milestones: Beta 3 - Dataverse 4.0, In Review - Dataverse 4.0 Jul 15, 2014

eaquigley modified the milestones: Beta 3 - Dataverse 4.0, Beta 5 - Dataverse 4.0 Aug 19, 2014

eaquigley modified the milestones: Beta 5 - Dataverse 4.0, Beta 6 - Dataverse 4.0 Aug 27, 2014

posixeleni modified the milestones: Beta 8 - Dataverse 4.0, Beta 6 - Dataverse 4.0 Sep 12, 2014

posixeleni modified the milestones: Dataverse 4.0: Final, Beta 8 - Dataverse 4.0, Beta 10 - Dataverse 4.0 Oct 15, 2014

posixeleni assigned scolapasta and unassigned posixeleni Oct 15, 2014

scolapasta modified the milestones: In Review - Dataverse 4.0, Beta 10 - Dataverse 4.0 Dec 3, 2014

scolapasta assigned landreev and unassigned scolapasta Jan 6, 2015

scolapasta modified the milestones: Beta 11 - Dataverse 4.0, In Review - Dataverse 4.0 Jan 6, 2015

scolapasta modified the milestones: Beta 11 - Dataverse 4.0, Dataverse 4.0: Final Jan 23, 2015

pdurbin referenced this issue Mar 8, 2017
Closed
200+ files uploaded to Harvard Dataverse failed ingest during one month in 2017 #3675

pdurbin referenced this issue Apr 21, 2017
Closed
CSV ingest improvements #3767

pdurbin added Feature: File Upload & Handling and removed Component: Documentation labels Jun 9, 2017

pdurbin added the User Role: Depositor label Jul 4, 2017

pdurbin referenced this issue Aug 10, 2017
Open
Ingest: Excel ingest can't accept more than 26 (Z) columns. #3382

adam3smith referenced this issue Mar 7, 2018
Open
Switch preservation format extension from .tab to .tsv #2720

djbrooke added this to Inbox 🗄 in IQSS/dataverse May 8, 2019

IQSS/dataverse

Join GitHub today

Ingest: Provide more robust ingest for Excel and CSV #585

Comments

eaquigley commented Jul 9, 2014

eaquigley added this to the Dataverse 4.0: In Review milestone Jul 9, 2014

eaquigley assigned posixeleni Jul 9, 2014

scolapasta modified the milestones: Beta 3 - Dataverse 4.0, In Review - Dataverse 4.0 Jul 15, 2014

eaquigley modified the milestones: Beta 3 - Dataverse 4.0, Beta 5 - Dataverse 4.0 Aug 19, 2014

eaquigley modified the milestones: Beta 5 - Dataverse 4.0, Beta 6 - Dataverse 4.0 Aug 27, 2014

This comment has been minimized.

eaquigley commented Aug 27, 2014

This comment has been minimized.

posixeleni commented Sep 2, 2014

posixeleni modified the milestones: Beta 8 - Dataverse 4.0, Beta 6 - Dataverse 4.0 Sep 12, 2014

This comment has been minimized.

posixeleni commented Sep 12, 2014

posixeleni modified the milestones: Dataverse 4.0: Final, Beta 8 - Dataverse 4.0, Beta 10 - Dataverse 4.0 Oct 15, 2014

posixeleni assigned scolapasta and unassigned posixeleni Oct 15, 2014

This comment has been minimized.

posixeleni commented Oct 15, 2014

scolapasta modified the milestones: In Review - Dataverse 4.0, Beta 10 - Dataverse 4.0 Dec 3, 2014

This comment has been minimized.

scolapasta commented Jan 6, 2015

scolapasta assigned landreev and unassigned scolapasta Jan 6, 2015

scolapasta modified the milestones: Beta 11 - Dataverse 4.0, In Review - Dataverse 4.0 Jan 6, 2015

scolapasta modified the milestones: Beta 11 - Dataverse 4.0, Dataverse 4.0: Final Jan 23, 2015

This comment has been minimized.

jggautier commented Feb 7, 2017

This comment has been minimized.

mercecrosas commented Feb 8, 2017

pdurbin referenced this issue Mar 8, 2017

200+ files uploaded to Harvard Dataverse failed ingest during one month in 2017 #3675

This comment has been minimized.

jggautier commented Apr 4, 2017 • edited

This comment has been minimized.

bencomp commented Apr 5, 2017

This comment has been minimized.

pdurbin commented Apr 5, 2017

This comment has been minimized.

landreev commented Apr 11, 2017

This comment has been minimized.

landreev commented Apr 11, 2017

This comment has been minimized.

landreev commented Apr 12, 2017

pdurbin referenced this issue Apr 21, 2017

CSV ingest improvements #3767

pdurbin added Feature: File Upload & Handling and removed Component: Documentation labels Jun 9, 2017

pdurbin added the User Role: Depositor label Jul 4, 2017

This comment has been minimized.

oscardssmith commented Jul 20, 2017

pdurbin referenced this issue Aug 10, 2017

Ingest: Excel ingest can't accept more than 26 (Z) columns. #3382

This comment has been minimized.

adam3smith commented Mar 7, 2018

This comment has been minimized.

pdurbin commented Mar 7, 2018

adam3smith referenced this issue Mar 7, 2018

Switch preservation format extension from .tab to .tsv #2720

This comment has been minimized.

adam3smith commented Mar 7, 2018

This comment has been minimized.

jggautier commented Mar 7, 2018 • edited

This comment has been minimized.

adam3smith commented Mar 7, 2018

This comment has been minimized.

mheppler commented Mar 7, 2018

djbrooke added this to Inbox 🗄 in IQSS/dataverse May 8, 2019

This comment has been minimized.

landreev commented May 31, 2019

This comment has been minimized.

pdurbin commented May 31, 2019

This comment has been minimized.

kcondon commented May 31, 2019

This comment has been minimized.

adam3smith commented May 31, 2019

jggautier commented Apr 4, 2017 •

edited

jggautier commented Mar 7, 2018 •

edited