Spreadsheet binary object extraction #303

ruebot · Jan 31, 2019

Using the image extraction process as a basis, our next set of binary object extractions will be documents. This issue is meant to focus specially on xls xlsx, ods, and csv.

There may be a some tweaks to this depending on the outcome of #298.

jrwiebe · Feb 13, 2019

Putting this here for my reference; feedback is welcome.

These are the spreadsheet MIME types Tika will identify. The Mozilla MIME type list was also consulted. Unless there are objections I think I'll extract all of these, including templates and MS Works spreadsheets.

Excel

application/vnd.ms-excel
application/vnd.ms-excel.workspace.3
application/vnd.ms-excel.workspace.4
application/vnd.ms-excel.sheet.2
application/vnd.ms-excel.sheet.3
application/vnd.ms-excel.sheet.4
application/vnd.ms-excel.addin.macroenabled.12
application/vnd.ms-excel.sheet.binary.macroenabled.12
application/vnd.ms-excel.sheet.macroenabled.12
application/vnd.ms-excel.template.macroenabled.12
application/vnd.ms-spreadsheetml

Open Office

application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/x-vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet
application/x-vnd.oasis.opendocument.spreadsheet

Other

application/x-tika-msworks-spreadsheet

CSV

Currently Tika only detects CSV if the parser is given a filename with the extension "CSV", although byte-based detection might be coming (TIKA-2826). I'll detect CSV by looking at the URL extension and checking if getMimeType() == "text/csv".

jrwiebe · Aug 2, 2019

MIME type references for #303, #304, #305, #306, #307:

That should do it.

ruebot added enhancement Scala feature DataFrames labels Jan 31, 2019

ruebot added this to In progress in Binary object extraction Jan 31, 2019

ruebot moved this from In progress to To do in Binary object extraction Jan 31, 2019

ruebot moved this from To do to In progress in Binary object extraction Aug 12, 2019

ruebot self-assigned this Aug 14, 2019

ruebot referenced this issue Aug 15, 2019
Merged
Add office document binary extraction. #346

ianmilligan1 closed this in #346 Aug 16, 2019

Binary object extraction automation moved this from In progress to Done Aug 16, 2019

archivesunleashed/aut

Spreadsheet binary object extraction #303

Spreadsheet binary object extraction #303

ruebot commented Jan 31, 2019

ruebot added enhancement Scala feature DataFrames labels Jan 31, 2019

ruebot added this to In progress in Binary object extraction Jan 31, 2019

ruebot moved this from In progress to To do in Binary object extraction Jan 31, 2019

This comment has been minimized.

jrwiebe commented Feb 13, 2019

This comment has been minimized.

jrwiebe commented Aug 2, 2019

ruebot moved this from To do to In progress in Binary object extraction Aug 12, 2019

ruebot self-assigned this Aug 14, 2019

ruebot added a commit that referenced this issue Aug 15, 2019

ruebot referenced this issue Aug 15, 2019

Add office document binary extraction. #346

ianmilligan1 closed this in #346 Aug 16, 2019

Binary object extraction automation moved this from In progress to Done Aug 16, 2019

ianmilligan1 added a commit that referenced this issue Aug 16, 2019

archivesunleashed/aut

Join GitHub today

Spreadsheet binary object extraction #303

Comments

ruebot commented Jan 31, 2019

ruebot added enhancement Scala feature DataFrames labels Jan 31, 2019

ruebot added this to In progress in Binary object extraction Jan 31, 2019

ruebot moved this from In progress to To do in Binary object extraction Jan 31, 2019

This comment has been minimized.

jrwiebe commented Feb 13, 2019

Excel

Open Office

Other

CSV

This comment has been minimized.

jrwiebe commented Aug 2, 2019

MIME type references for #303, #304, #305, #306, #307:

ruebot moved this from To do to In progress in Binary object extraction Aug 12, 2019

ruebot self-assigned this Aug 14, 2019

ruebot added a commit that referenced this issue Aug 15, 2019

ruebot referenced this issue Aug 15, 2019

Add office document binary extraction. #346

ianmilligan1 closed this in #346 Aug 16, 2019

Binary object extraction automation moved this from In progress to Done Aug 16, 2019

ianmilligan1 added a commit that referenced this issue Aug 16, 2019