Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upadds locale option initial-family-name-delimiter #299
Conversation
Why "initial"? |
Because it deals with initials? (Or what do you mean?) |
No, I'm just wondering: why is this only specific to initials? Is it feasible some styles would require this on non-initialized names? |
It's possible, yes. But I think it would be rather uncommon. Having a non-breaking space after an abbreviation and the following word (if they form a unit) is a rather common typographic rule on the other hand. |
Hmm, even Chicago recommends non-breaking spaces between initials. (6.121) |
That is handled by |
No idea why the closing keyword aren't working. |
I don't really see styles specifying that names in general cannot break across lines, but styles/locales do have varying rules regarding line breaks after initials or abbreviations, so I think limiting the behavior to initialized names is good. |
Should the value really be "text"? Or should we itemize, with just the single item for now? |
You mean
? |
Yes, with space default.
|
What do you think @bwiernik ? Should we make this change? |
Just for the record:
|
The problem I see with using text is twofold:
|
It should just be text. They should enter an HTML entity. This is why is done in hundreds of styles in the repository. It might be good to add a comment somewhere that non-ASCII characters should be entered as HTML entities. |
LGTM |
Why?
Ugh.
Hang on. We should not be using HTML entities for this in my view. For reference, a simple python function that will convert entities to Unicode. |
Why not? CSL styles/locales are XML. It's natural for them to be HTML entities. Beyond that:
|
As to why it should just be text, why make a more complex system with an enumerated list that (1) processors would have to build a handler for and that (2) works differently from every other "delimiter" attribute in CSL, rather than just letting style and locale writers enter the text that should appear. This really is an already solved problem. |
Not really. HTML entities are a funky legacy of a time before XML and unicode.
You raise some good points here, but in the end, because CSL is an output-format agnostic format; HTML entities aren't helpful for formats that don't support them. If we go this way, we should specify unicode, and can use a script to automate the translation on the styles repo. |
I absolutely oppose anything that isn't just typing text. Delimiters in CSL are entered as simple text--we shouldn't make that inconsistent. Every processor needs to be able to parse XML already, so they can decode entities if the output format requires it. There is also the problem that in some cases, using Unicode characters directly is less clear. For example, is a space a regular space or a non-breaking space? I think switching from HTML entities to Unicode would be more of problem than the existing status quo that is already implemented by styles, locales, and every CSL processor. I spent months a few years ago trying to get locale-agnostic rendering of Unicode characters to work nicely with R. Unicode is very poorly supported on Windows even to this day. I fear that entering Unicode characters is going to lead to lots of confusion from infrequent editors who try to edit styles in programs like Notepad that don't have good encoding recognition behavior. Before we arbitrarily switch conventions, I'd want to hear from @adam3smith and @rmzelle since they are the ones who would likely have to deal with this. |
Let's also tag @PaulStanley here (sorry Paul) again, since I believe he's working on a new processor that targets an output format that does not support HTML entities. Paul: the question is are you OK with HTML entities in CSL styles, or would you prefer Unicode? Why? |
I'll just mention here, in retrospect, this was probably a mistake, per the earlier discussion about whitespace handling in XML attributes. |
This is what ❯ cat t.xml
<test text=" "/>
❯ xmllint t.xml
t.xml:1: parser error : Entity 'nbsp' not defined
<test text=" "/> ... and jing: /tmp/t.xml:1:19: fatal: The entity "nbsp" was referenced, but not declared. E.g it's not valid XML unless it's explicitly declared in the same file, which adds an additional layer of complexity all around. OTOH, you have this: ❯ cat t.xml
<test text=" "/>
❯ xmllint t.xml
<?xml version="1.0"?>
<test text=" "/> So |
Just reading quickly: I don't see why we'd support HTML escaping (i.e. On the repository, we escape a number of common unicode characters to escaped XML because, as Brenton says, they're easier to distinguish. That's definitely em-dash and non-breaking space, but there are likely others (the script running formatter.citationstyles.org/ does this automatically as does Rintze's python reindent/reorder script (afaik they're identical). |
So if there are examples of |
I'll merge this then with |
There isn't. Validation fails with a &nsbp; ("undeclared entity"). Given that I think this is generally true for XML validation and not just an artifact of our validator, I'm not sure we need to specifically exclude HTML entities any more than we need to exclude unescaped ampersands -- this has never been a source of confusion. (I also don't have any strong objections to including it if there's any indication of confusion I'm not aware of.) |
I misspoke when I was remembering the entity name before. |
It is indeed a general rule; XML documents with undeclared entities are not
valid.
|
denismaier commentedJul 2, 2020
Description
per #236 (comment):
This adds a new locale option
initial-family-name-delimiter
.Closes #236
Type of change
Please delete options that are not relevant.
Checklist