Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds locale option initial-family-name-delimiter #299

Merged

Conversation

@denismaier
Copy link
Member

denismaier commented Jul 2, 2020

Description

per #236 (comment):

Specifically, French typography typically requires a non breaking space here, whereas US typography very much doesn't.
I.e. A. Smith vs. A. Smith
The former is currently impossible in Zotero.
This should likely be a locale spec

This adds a new locale option initial-family-name-delimiter.

Closes #236

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Checklist

  • I have installed the repo pre-commit hook; if not, and I have modified any of the schema files, I have run trang and/or prettier on the files, per CONTRIBUTING
  • I have performed a self-review of my own code
  • [] I have included suggested corresponding changes to the documentation (if relevant)
@denismaier denismaier requested review from bdarcus, adam3smith and bwiernik Jul 2, 2020
Copy link
Member

bdarcus left a comment

Why "initial"?

@denismaier
Copy link
Member Author

denismaier commented Jul 2, 2020

Why "initial"?

Because it deals with initials? (Or what do you mean?)
Or, do you have a better suggestion?

@bdarcus
Copy link
Member

bdarcus commented Jul 2, 2020

No, I'm just wondering: why is this only specific to initials? Is it feasible some styles would require this on non-initialized names?

@denismaier
Copy link
Member Author

denismaier commented Jul 2, 2020

It's possible, yes. But I think it would be rather uncommon. Having a non-breaking space after an abbreviation and the following word (if they form a unit) is a rather common typographic rule on the other hand.

@denismaier
Copy link
Member Author

denismaier commented Jul 2, 2020

Hmm, even Chicago recommends non-breaking spaces between initials. (6.121)
This leads to the question: How should we deal with delimiters between multiple initials?

@bwiernik
Copy link
Contributor

bwiernik commented Jul 2, 2020

That is handled by initialize-with

@denismaier
Copy link
Member Author

denismaier commented Jul 7, 2020

Why is this not linked to #236? @bwiernik @bdarcus ideas? Maybe because it hasn't master as the base branch?

@bwiernik
Copy link
Contributor

bwiernik commented Jul 7, 2020

No idea why the closing keyword aren't working.

@bwiernik bwiernik linked an issue that may be closed by this pull request Jul 7, 2020
@bwiernik
Copy link
Contributor

bwiernik commented Jul 7, 2020

I don't really see styles specifying that names in general cannot break across lines, but styles/locales do have varying rules regarding line breaks after initials or abbreviations, so I think limiting the behavior to initialized names is good.

@bdarcus
Copy link
Member

bdarcus commented Jul 16, 2020

Should the value really be "text"? Or should we itemize, with just the single item for now?

@denismaier
Copy link
Member Author

denismaier commented Jul 16, 2020

You mean

attribute initial-family-name-delimiter { "non-breaking-space" | "space" }?,

?

@bdarcus
Copy link
Member

bdarcus commented Jul 16, 2020

@denismaier
Copy link
Member Author

denismaier commented Jul 16, 2020

What do you think @bwiernik ? Should we make this change?

@denismaier
Copy link
Member Author

denismaier commented Jul 16, 2020

Just for the record: initialize-with is currently also defined as text

## Activate initializing of given names. The attribute value is appended
## to each initial (e.g., with ". ", "Orson Welles" becomes "O. Welles").
attribute initialize-with { text }?,
@bdarcus
Copy link
Member

bdarcus commented Jul 18, 2020

Just for the record: initialize-with is currently also defined as text

## Activate initializing of given names. The attribute value is appended
## to each initial (e.g., with ". ", "Orson Welles" becomes "O. Welles").
attribute initialize-with { text }?,

The problem I see with using text is twofold:

  1. UX - how is it a style author to know how to enter a non-breaking space?
  2. What is the content supposed to actually be? a unicode code point? If yes, see above. And what happens if someone puts in a non-unicode value?
@bwiernik
Copy link
Contributor

bwiernik commented Jul 18, 2020

It should just be text.

They should enter an HTML entity. This is why is done in hundreds of styles in the repository. It might be good to add a comment somewhere that non-ASCII characters should be entered as HTML entities.

@bwiernik
Copy link
Contributor

bwiernik commented Jul 18, 2020

LGTM

@bdarcus
Copy link
Member

bdarcus commented Jul 18, 2020

It should just be text.

Why?

They should enter an HTML entity. This is why is done in hundreds of styles in the repository. It might be good to add a comment somewhere that non-ASCII characters should be entered as HTML entities.

Ugh.

LGTM

Hang on.

We should not be using HTML entities for this in my view.

For reference, a simple python function that will convert entities to Unicode.

@bwiernik
Copy link
Contributor

bwiernik commented Jul 18, 2020

Why not? CSL styles/locales are XML. It's natural for them to be HTML entities.

Beyond that:

  1. It's already done that way in hundreds of styles.
  2. As you note, it's often easier for style writers to type HTML entities than extended unicode characters.
  3. HTML entities don't have potential locale/encoding corruption issues that just typing Unicode characters might have.
@bwiernik
Copy link
Contributor

bwiernik commented Jul 18, 2020

As to why it should just be text, why make a more complex system with an enumerated list that (1) processors would have to build a handler for and that (2) works differently from every other "delimiter" attribute in CSL, rather than just letting style and locale writers enter the text that should appear.

This really is an already solved problem.

@bdarcus
Copy link
Member

bdarcus commented Jul 18, 2020

Why not? CSL styles/locales are XML. It's natural for them to be HTML entities.

Not really. HTML entities are a funky legacy of a time before XML and unicode.

Beyond that:

  1. It's already done that way in hundreds of styles.
  2. As you note, it's often easier for style writers to type HTML entities than extended unicode characters.
  3. HTML entities don't have potential locale/encoding corruption issues that just typing Unicode characters might have.
  1. we can correct this, so this is a non-issue
  2. both suck though from a style authoring perspective
  3. but the below doesn't either, and doesn't suck (though does have other limitations, but not clear to me that matters here)
initial-family-name-delimiter='non-breaking-space'

As to why it should just be text, why make a more complex system with an enumerated list that (1) processors would have to build a handler for and that (2) works differently from every other "delimiter" attribute in CSL, rather than just letting style and locale writers enter the text that should appear.

You raise some good points here, but in the end, because CSL is an output-format agnostic format; HTML entities aren't helpful for formats that don't support them.

If we go this way, we should specify unicode, and can use a script to automate the translation on the styles repo.

@bwiernik
Copy link
Contributor

bwiernik commented Jul 18, 2020

I absolutely oppose anything that isn't just typing text. Delimiters in CSL are entered as simple text--we shouldn't make that inconsistent.

Every processor needs to be able to parse XML already, so they can decode entities if the output format requires it.

There is also the problem that in some cases, using Unicode characters directly is less clear. For example, is a space a regular space or a non-breaking space?   is unambiguous when editing.

I think switching from HTML entities to Unicode would be more of problem than the existing status quo that is already implemented by styles, locales, and every CSL processor.

I spent months a few years ago trying to get locale-agnostic rendering of Unicode characters to work nicely with R. Unicode is very poorly supported on Windows even to this day. I fear that entering Unicode characters is going to lead to lots of confusion from infrequent editors who try to edit styles in programs like Notepad that don't have good encoding recognition behavior.

Before we arbitrarily switch conventions, I'd want to hear from @adam3smith and @rmzelle since they are the ones who would likely have to deal with this.

@bdarcus
Copy link
Member

bdarcus commented Jul 18, 2020

Let's also tag @PaulStanley here (sorry Paul) again, since I believe he's working on a new processor that targets an output format that does not support HTML entities.

Paul: the question is are you OK with HTML entities in CSL styles, or would you prefer Unicode? Why?

@bdarcus
Copy link
Member

bdarcus commented Jul 18, 2020

I absolutely oppose anything that isn't just typing text. Delimiters in CSL are entered as simple text--we shouldn't make that inconsistent.

I'll just mention here, in retrospect, this was probably a mistake, per the earlier discussion about whitespace handling in XML attributes.

@bdarcus
Copy link
Member

bdarcus commented Jul 18, 2020

This is what xmllint, arguably the most widely used xml parser, does with this entity.

❯ cat t.xml 
<test text="&nbsp;"/>
                                                                                                                                 
❯ xmllint t.xml
t.xml:1: parser error : Entity 'nbsp' not defined
<test text="&nbsp;"/>

... and jing:

/tmp/t.xml:1:19: fatal: The entity "nbsp" was referenced, but not declared.

E.g it's not valid XML unless it's explicitly declared in the same file, which adds an additional layer of complexity all around.

OTOH, you have this:

❯ cat t.xml 
<test text="&#160;"/>

❯ xmllint t.xml
<?xml version="1.0"?>
<test text="&#xA0;"/>

So &#160; is valid, and will internally convert to &#xA0;.

@adam3smith
Copy link
Member

adam3smith commented Jul 18, 2020

Just reading quickly:
I dislike the lack of flexibility introduce by something like:
initial-family-name-delimiter='non-breaking-space'
What, for example, if someone wants a narrow non-breaking space here (we currently discourage that on the repository due to poor support in fonts, but that might change and it can already be used for custom styles). So I'd lean towards text.

I don't see why we'd support HTML escaping (i.e. &nbsp;) over XML escaping &#160;, which we currently use, which is legal XML and which works with all the linters (try e.g. with Bruce's example above). I'm guessing @bwiernik is actually referring to the latter because introducing random HTML into XML does indeed sound like a very bad idea.

On the repository, we escape a number of common unicode characters to escaped XML because, as Brenton says, they're easier to distinguish. That's definitely em-dash and non-breaking space, but there are likely others (the script running formatter.citationstyles.org/ does this automatically as does Rintze's python reindent/reorder script (afaik they're identical).

@bdarcus
Copy link
Member

bdarcus commented Jul 18, 2020

I don't see why we'd support HTML escaping (i.e. &nbsp;) over XML escaping &#160;

So if there are examples of &nbsp; in the styles repo, that's an oversight we should fix? I'm not sure there is, but that's what I was gathering from this discussion (which as you note, could have been a misunderstanding).

@bdarcus
Copy link
Member

bdarcus commented Jul 18, 2020

I'll merge this then with text, but urge we explicitly say people should use unicode everywhere in CSL files in the spec, and NOT use HTML entities.

@bdarcus bdarcus merged commit 607729d into citation-style-language:v1.1 Jul 18, 2020
2 checks passed
2 checks passed
CSL JSON Tests
Details
CSL XML Tests
Details
@adam3smith
Copy link
Member

adam3smith commented Jul 18, 2020

So if there are examples of   in the styles repo, that's an oversight we should fix?

There isn't. Validation fails with a &nsbp; ("undeclared entity"). Given that I think this is generally true for XML validation and not just an artifact of our validator, I'm not sure we need to specifically exclude HTML entities any more than we need to exclude unescaped ampersands -- this has never been a source of confusion. (I also don't have any strong objections to including it if there's any indication of confusion I'm not aware of.)

@bwiernik
Copy link
Contributor

bwiernik commented Jul 18, 2020

I misspoke when I was remembering the entity name before.

@bdarcus
Copy link
Member

bdarcus commented Jul 18, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants
You can’t perform that action at this time.