adds locale option initial-family-name-delimiter #299

denismaier · 2020-07-02T08:50:42Z

Description

Specifically, French typography typically requires a non breaking space here, whereas US typography very much doesn't.
I.e. A. Smith vs. A. Smith
The former is currently impossible in Zotero.
This should likely be a locale spec

This adds a new locale option initial-family-name-delimiter.

Closes #236

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Checklist

I have installed the repo pre-commit hook; if not, and I have modified any of the schema files, I have run trang and/or prettier on the files, per CONTRIBUTING
I have performed a self-review of my own code
[] I have included suggested corresponding changes to the documentation (if relevant)


        adds locale option initial-family-name-delimiter

bdarcus

Why "initial"?

denismaier · 2020-07-02T10:27:47Z

Why "initial"?

Because it deals with initials? (Or what do you mean?)
Or, do you have a better suggestion?

bdarcus · 2020-07-02T10:31:42Z

No, I'm just wondering: why is this only specific to initials? Is it feasible some styles would require this on non-initialized names?

denismaier · 2020-07-02T10:41:08Z

It's possible, yes. But I think it would be rather uncommon. Having a non-breaking space after an abbreviation and the following word (if they form a unit) is a rather common typographic rule on the other hand.

denismaier · 2020-07-02T10:45:30Z

Hmm, even Chicago recommends non-breaking spaces between initials. (6.121)
This leads to the question: How should we deal with delimiters between multiple initials?

bwiernik · 2020-07-02T12:15:27Z

That is handled by initialize-with

denismaier · 2020-07-07T08:58:19Z

Why is this not linked to #236? @bwiernik @bdarcus ideas? Maybe because it hasn't master as the base branch?

bwiernik · 2020-07-07T16:15:47Z

No idea why the closing keyword aren't working.

bwiernik · 2020-07-07T16:20:12Z

I don't really see styles specifying that names in general cannot break across lines, but styles/locales do have varying rules regarding line breaks after initials or abbreviations, so I think limiting the behavior to initialized names is good.

bdarcus · 2020-07-16T21:19:17Z

Should the value really be "text"? Or should we itemize, with just the single item for now?

denismaier · 2020-07-16T21:23:06Z

You mean

attribute initial-family-name-delimiter { "non-breaking-space" | "space" }?,

?

bdarcus · 2020-07-16T21:25:41Z

Yes, with space default.

denismaier · 2020-07-16T21:27:41Z

What do you think @bwiernik ? Should we make this change?

denismaier · 2020-07-16T21:29:39Z

Just for the record: initialize-with is currently also defined as text

## Activate initializing of given names. The attribute value is appended
## to each initial (e.g., with ". ", "Orson Welles" becomes "O. Welles").
attribute initialize-with { text }?,

bdarcus · 2020-07-18T12:08:45Z

Just for the record: initialize-with is currently also defined as text

## Activate initializing of given names. The attribute value is appended
## to each initial (e.g., with ". ", "Orson Welles" becomes "O. Welles").
attribute initialize-with { text }?,

The problem I see with using text is twofold:

UX - how is it a style author to know how to enter a non-breaking space?
What is the content supposed to actually be? a unicode code point? If yes, see above. And what happens if someone puts in a non-unicode value?

bwiernik · 2020-07-18T12:21:08Z

It should just be text.

They should enter an HTML entity. This is why is done in hundreds of styles in the repository. It might be good to add a comment somewhere that non-ASCII characters should be entered as HTML entities.

bwiernik · 2020-07-18T12:21:40Z

LGTM

bdarcus · 2020-07-18T12:27:38Z

It should just be text.

Why?

They should enter an HTML entity. This is why is done in hundreds of styles in the repository. It might be good to add a comment somewhere that non-ASCII characters should be entered as HTML entities.

Ugh.

LGTM

Hang on.

We should not be using HTML entities for this in my view.

For reference, a simple python function that will convert entities to Unicode.

bwiernik · 2020-07-18T12:38:48Z

Why not? CSL styles/locales are XML. It's natural for them to be HTML entities.

Beyond that:

It's already done that way in hundreds of styles.
As you note, it's often easier for style writers to type HTML entities than extended unicode characters.
HTML entities don't have potential locale/encoding corruption issues that just typing Unicode characters might have.

bwiernik · 2020-07-18T12:43:34Z

As to why it should just be text, why make a more complex system with an enumerated list that (1) processors would have to build a handler for and that (2) works differently from every other "delimiter" attribute in CSL, rather than just letting style and locale writers enter the text that should appear.

This really is an already solved problem.

bwiernik · 2020-07-18T12:46:02Z

See for example: https://github.com/citation-style-language/locales/blob/master/locales-fr-FR.xml

bdarcus · 2020-07-18T12:54:52Z

Why not? CSL styles/locales are XML. It's natural for them to be HTML entities.

Not really. HTML entities are a funky legacy of a time before XML and unicode.

Beyond that:

It's already done that way in hundreds of styles.

As you note, it's often easier for style writers to type HTML entities than extended unicode characters.

HTML entities don't have potential locale/encoding corruption issues that just typing Unicode characters might have.

we can correct this, so this is a non-issue
both suck though from a style authoring perspective
but the below doesn't either, and doesn't suck (though does have other limitations, but not clear to me that matters here)

initial-family-name-delimiter='non-breaking-space'

As to why it should just be text, why make a more complex system with an enumerated list that (1) processors would have to build a handler for and that (2) works differently from every other "delimiter" attribute in CSL, rather than just letting style and locale writers enter the text that should appear.

You raise some good points here, but in the end, because CSL is an output-format agnostic format; HTML entities aren't helpful for formats that don't support them.

If we go this way, we should specify unicode, and can use a script to automate the translation on the styles repo.

bwiernik · 2020-07-18T13:09:08Z

I absolutely oppose anything that isn't just typing text. Delimiters in CSL are entered as simple text--we shouldn't make that inconsistent.

Every processor needs to be able to parse XML already, so they can decode entities if the output format requires it.

There is also the problem that in some cases, using Unicode characters directly is less clear. For example, is a space a regular space or a non-breaking space?   is unambiguous when editing.

I think switching from HTML entities to Unicode would be more of problem than the existing status quo that is already implemented by styles, locales, and every CSL processor.

I spent months a few years ago trying to get locale-agnostic rendering of Unicode characters to work nicely with R. Unicode is very poorly supported on Windows even to this day. I fear that entering Unicode characters is going to lead to lots of confusion from infrequent editors who try to edit styles in programs like Notepad that don't have good encoding recognition behavior.

Before we arbitrarily switch conventions, I'd want to hear from @adam3smith and @rmzelle since they are the ones who would likely have to deal with this.

bdarcus · 2020-07-18T13:24:58Z

Let's also tag @PaulStanley here (sorry Paul) again, since I believe he's working on a new processor that targets an output format that does not support HTML entities.

Paul: the question is are you OK with HTML entities in CSL styles, or would you prefer Unicode? Why?

bdarcus · 2020-07-18T13:25:47Z

I absolutely oppose anything that isn't just typing text. Delimiters in CSL are entered as simple text--we shouldn't make that inconsistent.

I'll just mention here, in retrospect, this was probably a mistake, per the earlier discussion about whitespace handling in XML attributes.

bdarcus · 2020-07-18T13:40:28Z

This is what xmllint, arguably the most widely used xml parser, does with this entity.

❯ cat t.xml 
<test text="&nbsp;"/>
                                                                                                                                 
❯ xmllint t.xml
t.xml:1: parser error : Entity 'nbsp' not defined
<test text="&nbsp;"/>

... and jing:

/tmp/t.xml:1:19: fatal: The entity "nbsp" was referenced, but not declared.

E.g it's not valid XML unless it's explicitly declared in the same file, which adds an additional layer of complexity all around.

OTOH, you have this:

❯ cat t.xml 
<test text="&#160;"/>

❯ xmllint t.xml
<?xml version="1.0"?>
<test text="&#xA0;"/>

So   is valid, and will internally convert to  .

adam3smith · 2020-07-18T13:48:08Z

Just reading quickly:
I dislike the lack of flexibility introduce by something like:
initial-family-name-delimiter='non-breaking-space'
What, for example, if someone wants a narrow non-breaking space here (we currently discourage that on the repository due to poor support in fonts, but that might change and it can already be used for custom styles). So I'd lean towards text.

I don't see why we'd support HTML escaping (i.e.  ) over XML escaping  , which we currently use, which is legal XML and which works with all the linters (try e.g. with Bruce's example above). I'm guessing @bwiernik is actually referring to the latter because introducing random HTML into XML does indeed sound like a very bad idea.

On the repository, we escape a number of common unicode characters to escaped XML because, as Brenton says, they're easier to distinguish. That's definitely em-dash and non-breaking space, but there are likely others (the script running formatter.citationstyles.org/ does this automatically as does Rintze's python reindent/reorder script (afaik they're identical).

bdarcus · 2020-07-18T13:49:58Z

I don't see why we'd support HTML escaping (i.e.  ) over XML escaping  

So if there are examples of   in the styles repo, that's an oversight we should fix? I'm not sure there is, but that's what I was gathering from this discussion (which as you note, could have been a misunderstanding).

bdarcus · 2020-07-18T13:57:33Z

I'll merge this then with text, but urge we explicitly say people should use unicode everywhere in CSL files in the spec, and NOT use HTML entities.

adam3smith · 2020-07-18T17:58:35Z

So if there are examples of in the styles repo, that's an oversight we should fix?

There isn't. Validation fails with a &nsbp; ("undeclared entity"). Given that I think this is generally true for XML validation and not just an artifact of our validator, I'm not sure we need to specifically exclude HTML entities any more than we need to exclude unescaped ampersands -- this has never been a source of confusion. (I also don't have any strong objections to including it if there's any indication of confusion I'm not aware of.)

bwiernik · 2020-07-18T18:00:54Z

I misspoke when I was remembering the entity name before.

bdarcus · 2020-07-18T18:29:35Z

It is indeed a general rule; XML documents with undeclared entities are not valid.

adds locale option initial-family-name-delimiter

Loading status checks…

bfdc734

denismaier requested review from bdarcus, adam3smith and bwiernik Jul 2, 2020

bdarcus reviewed Jul 2, 2020

View changes

bwiernik linked an issue that may be closed by this pull request Jul 7, 2020

Make space between initial and last name customizable #236

Open

bwiernik approved these changes Jul 7, 2020

View changes

bdarcus merged commit 607729d into citation-style-language:v1.1 Jul 18, 2020
2 checks passed

2 checks passed

CSL JSON Tests
Details

CSL XML Tests
Details

citation-style-language / schema

adds locale option initial-family-name-delimiter #299

adds locale option initial-family-name-delimiter #299

denismaier commented Jul 2, 2020

bdarcus left a comment

denismaier commented Jul 2, 2020 •

edited

bdarcus commented Jul 2, 2020

denismaier commented Jul 2, 2020

denismaier commented Jul 2, 2020

bwiernik commented Jul 2, 2020

denismaier commented Jul 7, 2020 •

edited

bwiernik commented Jul 7, 2020

bwiernik commented Jul 7, 2020

bdarcus commented Jul 16, 2020

denismaier commented Jul 16, 2020 •

edited

bdarcus commented Jul 16, 2020

denismaier commented Jul 16, 2020

denismaier commented Jul 16, 2020

bdarcus commented Jul 18, 2020 •

edited

bwiernik commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bdarcus commented Jul 18, 2020 •

edited

bwiernik commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bdarcus commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bdarcus commented Jul 18, 2020 •

edited

bdarcus commented Jul 18, 2020 •

edited

bdarcus commented Jul 18, 2020 •

edited

adam3smith commented Jul 18, 2020

bdarcus commented Jul 18, 2020 •

edited

bdarcus commented Jul 18, 2020

adam3smith commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bdarcus commented Jul 18, 2020

citation-style-language / schema

Sponsor citation-style-language/schema

Join GitHub today

adds locale option initial-family-name-delimiter #299

adds locale option initial-family-name-delimiter #299

Conversation

denismaier commented Jul 2, 2020

Description

Type of change

Checklist

bdarcus left a comment

denismaier commented Jul 2, 2020 • edited

bdarcus commented Jul 2, 2020

denismaier commented Jul 2, 2020

denismaier commented Jul 2, 2020

bwiernik commented Jul 2, 2020

denismaier commented Jul 7, 2020 • edited

bwiernik commented Jul 7, 2020

bwiernik commented Jul 7, 2020

bdarcus commented Jul 16, 2020

denismaier commented Jul 16, 2020 • edited

bdarcus commented Jul 16, 2020

denismaier commented Jul 16, 2020

denismaier commented Jul 16, 2020

bdarcus commented Jul 18, 2020 • edited

bwiernik commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bdarcus commented Jul 18, 2020 • edited

bwiernik commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bdarcus commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bdarcus commented Jul 18, 2020 • edited

bdarcus commented Jul 18, 2020 • edited

bdarcus commented Jul 18, 2020 • edited

adam3smith commented Jul 18, 2020

bdarcus commented Jul 18, 2020 • edited

bdarcus commented Jul 18, 2020

adam3smith commented Jul 18, 2020

bwiernik commented Jul 18, 2020

bdarcus commented Jul 18, 2020

denismaier commented Jul 2, 2020 •

edited

denismaier commented Jul 7, 2020 •

edited

denismaier commented Jul 16, 2020 •

edited

bdarcus commented Jul 18, 2020 •

edited

bdarcus commented Jul 18, 2020 •

edited

bdarcus commented Jul 18, 2020 •

edited

bdarcus commented Jul 18, 2020 •

edited

bdarcus commented Jul 18, 2020 •

edited

bdarcus commented Jul 18, 2020 •

edited