Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spreadsheet binary object extraction #303

Closed
ruebot opened this issue Jan 31, 2019 · 2 comments

Comments

@ruebot
Copy link
Member

commented Jan 31, 2019

Using the image extraction process as a basis, our next set of binary object extractions will be documents. This issue is meant to focus specially on xls xlsx, ods, and csv.

There may be a some tweaks to this depending on the outcome of #298.

@ruebot ruebot added this to In progress in Binary object extraction Jan 31, 2019

@ruebot ruebot moved this from In progress to To do in Binary object extraction Jan 31, 2019

@jrwiebe

This comment has been minimized.

Copy link
Contributor

commented Feb 13, 2019

Putting this here for my reference; feedback is welcome.

These are the spreadsheet MIME types Tika will identify. The Mozilla MIME type list was also consulted. Unless there are objections I think I'll extract all of these, including templates and MS Works spreadsheets.

Excel

application/vnd.ms-excel
application/vnd.ms-excel.workspace.3
application/vnd.ms-excel.workspace.4
application/vnd.ms-excel.sheet.2
application/vnd.ms-excel.sheet.3
application/vnd.ms-excel.sheet.4
application/vnd.ms-excel.addin.macroenabled.12
application/vnd.ms-excel.sheet.binary.macroenabled.12
application/vnd.ms-excel.sheet.macroenabled.12
application/vnd.ms-excel.template.macroenabled.12
application/vnd.ms-spreadsheetml

Open Office

application/vnd.openxmlformats-officedocument.spreadsheetml.template
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/x-vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet-template
application/vnd.oasis.opendocument.spreadsheet
application/x-vnd.oasis.opendocument.spreadsheet

Other

application/x-tika-msworks-spreadsheet

CSV

Currently Tika only detects CSV if the parser is given a filename with the extension "CSV", although byte-based detection might be coming (TIKA-2826). I'll detect CSV by looking at the URL extension and checking if getMimeType() == "text/csv".

@ruebot ruebot moved this from To do to In progress in Binary object extraction Aug 12, 2019

@ruebot ruebot self-assigned this Aug 14, 2019

ruebot added a commit that referenced this issue Aug 15, 2019

Add office document binary extraction.
- Add WordProcessor DF and binary extraction
- Add Spreadsheets DF and binary extraction
- Add Presentation Program DF and binary extraction
- Add tests for new DF and binary extractions
- Add test fixture for new DF and binary extractions
- Resolves #303
- Resolves #304
- Resolves #305
- Back out 39831c2 (We _might_ not have
to do this)

Binary object extraction automation moved this from In progress to Done Aug 16, 2019

ianmilligan1 added a commit that referenced this issue Aug 16, 2019

Add office document binary extraction. (#346)
- Add Word Processor DF and binary extraction
- Add Spreadsheets DF and binary extraction
- Add Presentation Program DF and binary extraction
- Add Text files DF and binary extraction
- Add tests for new DF and binary extractions
- Add test fixtures for new DF and binary extractions
- Resolves #303
- Resolves #304
- Resolves #305
- Use aut-resources repo to distribute our shaded tika-parsers 1.22
- Close TikaInputStream
- Add RDD filters on MimeTypeTika values
- Add CodeCov configuration yaml
- Includes work by @jrwiebe, see #346 for all commits before squash
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
2 participants
You can’t perform that action at this time.