Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upSpreadsheet binary object extraction #303
Comments
ruebot
added
enhancement
Scala
feature
DataFrames
labels
Jan 31, 2019
ruebot
added this to In progress
in Binary object extraction
Jan 31, 2019
ruebot
moved this from In progress
to To do
in Binary object extraction
Jan 31, 2019
This comment has been minimized.
This comment has been minimized.
Putting this here for my reference; feedback is welcome. These are the spreadsheet MIME types Tika will identify. The Mozilla MIME type list was also consulted. Unless there are objections I think I'll extract all of these, including templates and MS Works spreadsheets. Excelapplication/vnd.ms-excel Open Officeapplication/vnd.openxmlformats-officedocument.spreadsheetml.template Otherapplication/x-tika-msworks-spreadsheet CSVCurrently Tika only detects CSV if the parser is given a filename with the extension "CSV", although byte-based detection might be coming (TIKA-2826). I'll detect CSV by looking at the URL extension and checking if |
This comment has been minimized.
This comment has been minimized.
MIME type references for #303, #304, #305, #306, #307:
That should do it. |
ruebot commentedJan 31, 2019
Using the image extraction process as a basis, our next set of binary object extractions will be documents. This issue is meant to focus specially on
xls
xlsx
,ods
, andcsv
.There may be a some tweaks to this depending on the outcome of #298.