Opened 14 years ago

Last modified 14 years ago

#63 defect closed fixed (fixed)

[PATCH]Document types of avatarIds

Reported by: itamarst Owned by:
Priority: high Milestone:
Component: conch Keywords:
Cc: Glyph, radix, itamarst Branch:
Author:

Description


Change History (24)

comment:1 Changed 14 years ago by radix

+1 for allow unicode. What's so hard about encoding an
avatar ID into utf-8 for backends?

What reasons are there to not allow it?

comment:2 Changed 14 years ago by Glyph

+0.  UTF-8 is an 8-bit encoding, which means it's not just ASCII.  You 
can encode NULs and so forth.  Some backends expect a different 
encoding and a different translation to unicode (pre-existing 
authentication databases going against Oracle with JIS japanese 
encoding, off the top of my head).

It's doable, and for 99% of the cases out there it won't make a 
difference to client code, but there are still other considerations.  
What if the Avatar ID is some encoding of an integer, and not a string?

I'd like to avoid doing this until somebody who really knows unicode 
can tell us how.

comment:3 Changed 14 years ago by radix

How about the policy "it's up to the cred-checker [or
whichever bit is relevant] to accept unicode or not"? That's
basically how it is now, afaict?

comment:4 Changed 14 years ago by itamarst

To clarify - credentialcheckers generate an avatar id, which
is passed to realm. We need to make sure realms work with
*all* credential checkers, and if most checkers generate
strings and the realm assumes this, and then admin changes
to a checker that produces unicode, the realm will *break*.

So, at the minimum we need to require realms to support
unicode, which we do not do at the moment.

comment:5 Changed 14 years ago by radix

So if I understand correctly, if we decide to say "avatar
IDs are unicode", then only the credentials-checker has to
care about encoding or decoding the unicode (iff the
backend/storage mechanism it uses doesn't natively support
unicode). When is this a problem?

comment:6 Changed 14 years ago by itamarst

If we document that avatar ids can be both unicode and
8-bit, we should be ok, since all realms will be able to
deal with both by downgrading or upgrading, depending on how
their storage works. So if we ever decide to restrict it to
unicode only it will still work.

comment:7 Changed 14 years ago by radix

well? this is "urgent", has the fix been decided? are we
just going to document that realms should accept both
unicode and regular strings?

comment:8 Changed 14 years ago by radix

this is obviously not urgent

comment:9 Changed 14 years ago by Jean-Paul Calderone

http://www.ietf.org/internet-drafts/draft-ietf-sasl-saslprep-03.txt
has something to say about this.

comment:10 Changed 14 years ago by Glyph

Ahem.

"It is not intended to be used for to prepare identities
which are not simple user names (e.g., distinguished names
and domain names).  Nor is the profile intended to be used
for simple user names which require different handling. 
Protocols (or applications of those protocols) which have
application-specific identity forms and/or comparison
algorithms should use mechanisms specifically designed for
these forms and algorithms."

I don't understand what that spec is trying to say.

How about this - for comparison and such, we will always
call 'credStringValue.decode("utf-8")'.  This will disallow
non-ASCII characters in non-unicode strings, but will still
allow unicode strings.

comment:11 Changed 14 years ago by Glyph

Plus, we should get this in for release 1.1.

comment:12 Changed 14 years ago by Glyph

Plus I meant ".encode", not ".decode"

comment:13 Changed 14 years ago by Moshe Zadka

Is this really a bug?
I'm not sure...

comment:14 Changed 14 years ago by Glyph

It's a doc bug, at least.

comment:15 Changed 14 years ago by Glyph

Okay, it's clear nobody is totally sure how to do this correctly, so I'm
removing the release1.1 tag.  It's still a doc bug, and we need to find someone
who has real unicode-login-name use cases and be sure that the solution I've
outlined below works.  But I can't see why it wouldn't.

comment:16 Changed 14 years ago by Moshe Zadka

Should we just document that "currently, unicode IDs are not supported -- if 
you have a use case, please explain it in a bug report" and close this? I'm 
loath to add any more code, or *EVEN DOCUMENT AN APPROACH* if we have nfi
what we are talking about. I'd feel much safer supporting stuff with a use
case in mind [even if the support comes to documenting stuff].

comment:17 Changed 14 years ago by Moshe Zadka

Here's a patch to document the non-supportingness of unicode strings

Index: doc/howto/cred.xhtml
===================================================================
RCS file: /cvs/Twisted/doc/howto/cred.xhtml,v
retrieving revision 1.5
diff -u -r1.5 cred.xhtml
--- doc/howto/cred.xhtml        17 Oct 2003 04:46:19 -0000      1.5
+++ doc/howto/cred.xhtml        19 Oct 2003 12:21:43 -0000
@@ -128,6 +128,12 @@
 <p>This method will typically be called from 'Portal.login'.  The avatarId
 is the one returned by a CredentialChecker.</p>

+<div class="note">
+Avatars, currently, can only be strings. Passing unicode strings around,
+in particular, is <em>not</em> supported by the infrastructure. If you
+find a need for unicode usernames, please file a bug with your specific
+use-case.</div>
+
 <p>The important thing to realize about this method is that if it is being
 called, <em>the user has already authenticated</em>.  Therefore, if possible,
 the Realm should create a new user if one does not already exist

comment:18 Changed 14 years ago by Glyph

I don't have no idea whatsoever, I just don't know that this is a 
panacea.  Python's encoding support is very well done, so it's not like 
we're designing from scratch either.

Considering that this approach will continue to work even if we firm up 
the spec so that it's no longer really necessary, I'd still like to 
suggest it, rather than having folks who *really* have NFI what they're 
talking about come up with some cockeyed idea where they just have 
magical realms that emit some other random instance object from 
requestAvatarId rather than actually using this "workaround" for 
conforming to the interface.

comment:19 Changed 14 years ago by Moshe Zadka

I didn't understand a word you said.
Please attempt to be clearer.
My attempt is to firm up the interface *now* and *perhaps* loosen
it *later* if and when we have a use case. For example, do you think
we should support unicode avatar ids in files? [in which case a utf-8
thingy might be sane] Do you think we should support unicode avatar ids
from databases? [in which case it's better to work with opaque objects
and do no encoding/decoding at all] What happens when a unicode conversion
error happens when trying to see if an avatar id belongs to a checker?
Do we treat it as user-not-found or as
catastrophical-bug-argh-shut-down-connection?

I have no answers for any of those. In my sole experience with a
Hebrew/Arabic/English site, all the usernames [avatar IDs in cred-speak]
are in plain ASCII [and I'm *pretty sure* it's not a database problem
but a decision non-technical users made] So I'm pretty sure
unicode-in-usernames is a) not an important issue b) a world of hurt.
Hence, I would tend to discourage writing support for it until someone
comes with a clear use-case ["I need Japanese username support. My
users hate UTF-8 because it was invented by white people. I have a
colon-separated file with usernames in SHIFT-JIS and passwords in
ASCII. How do I use cred?" is a somewhat tongue-in-cheek but not
*entirely* unrealistic, and I'd hate to tell this guy "well, we made
some decisions about unicode incompatible with your needs. Nobody really
uses unicode though, so let's try breaking unicode compatibility and
see what happens."]

comment:20 Changed 14 years ago by Glyph

> My attempt is to firm up the interface *now* and *perhaps* loosen
> it *later* if and when we have a use case.

My goal is the same.  My rationale for suggesting the encoding strategy 
is to say to potential users, "This is the interface: str=>str.  We are 
not supporting unicode, and we're doing that on purpose.  Encode it 
UTF-8 if you must, because that at least looks like ASCII some of the 
time.  If you have a better idea for how this should work, let us know, 
but in the meanwhile DON'T decide to return random junk like 
EncodedUsername("HELLO", "latin-1") from your requestAvatarId in order 
to support internationalization: return a string or your code cannot 
possibly work with other peoples' realms."

> For example, do you think
> we should support unicode avatar ids in files? [in which case a utf-8
> thingy might be sane]

Yes.

> Do you think we should support unicode avatar ids
> from databases? [in which case it's better to work with opaque objects
> and do no encoding/decoding at all]

Yes, but the *way* we should support unicode from databases with our 
current interface would be to encode to utf-8 on one side of the 
interface and decode on the other.  We don't have a clear idea of what a 
good opaque object would be.

> What happens when a unicode conversion
> error happens when trying to see if an avatar id belongs to a checker?
> Do we treat it as user-not-found or as
> catastrophical-bug-argh-shut-down-connection?

Well, that's up to the checker, to some extent.  If implemented 
properly, "user not found".  If not implemented properly, 
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: 
ordinal not in range(128)" or some variation on that: this will still be 
handled just fine by the checker and will not yield any particularly 
sensitive information to the client.

> I have no answers for any of those. In my sole experience with a
> Hebrew/Arabic/English site, all the usernames [avatar IDs in cred-speak]
> are in plain ASCII [and I'm *pretty sure* it's not a database problem
> but a decision non-technical users made] So I'm pretty sure
> unicode-in-usernames is a) not an important issue b) a world of hurt.
> Hence, I would tend to discourage writing support for it until someone
> comes with a clear use-case ["I need Japanese username support. My
> users hate UTF-8 because it was invented by white people. I have a
> colon-separated file with usernames in SHIFT-JIS and passwords in
> ASCII. How do I use cred?" is a somewhat tongue-in-cheek but not
> *entirely* unrealistic, and I'd hate to tell this guy "well, we made
> some decisions about unicode incompatible with your needs. Nobody really
> uses unicode though, so let's try breaking unicode compatibility and
> see what happens."]

SHIFT-JIS encoded text by itself can easily be brought back and forth to 
unicode, no?  Doesn't the problem with "white people invented it" only 
arise when you *mix* encodings?  e.g. some BIG5 and some JIS on the same 
webpage?

In other words, just have him have his checker know that the data 
storage format is in JIS, but pass the username around encoded UTF-8 
when it goes to the realm.  Any display software he's writing along with 
this will have to store the region-encoding hint along with their avatar 
  so that when he adds korean support, it will know which usernames came 
from korean-encoded chinese vs. japanese-encoded kanji, but the extra 
encode/decode/encode step will still go through unicode on its way 
through the avatar and not cause problems or lose information.

Also, if I'm wrong (having never been *directly* involved with this kind 
of asian-language insanity, I'm sure my understanding is at least 
partially flawed) let's say he has to have special knowledge in his 
realm of his checker.  It's not the end of the world.  With such an 
unusual use-case, it is unlikely he will require integration with other 
peoples' cred software, but *even if he does*, if he just has a wacky 
encoding scheme, the sysadmin can just do a little work in the realm's 
storage layer to make sure that it matches up with the peculiarities of 
his encoding, manually running some scripts to go from SHIFT-JIS to 
UTF-8 if necessary.  This is, after all, what sysadmins do :).

*BUT*, this only works if we encourage some sanity in that we do not say 
"we don't know how unicode should work at all, just do whatever you 
want" - thus encouraging anyone with a unicode-ish use case, even 
someone who knows considerably *less* about the potential problems that 
entails (think newbie ex-java programmer here) they may decide to come 
up with a whole secondary framework for username encodings, along with 
self-hashing subclasses of string or other insanity, rather than just 
adhering to this simple convention, because it is "cleaner" not to have 
to call .encode or .decode in their application logic.

Hopefully this is clear.  This suggestion is intended to preserve the 
existing interface in all instances where it can be preserved, and to 
give some boundaries for people who *THINK* their use-case is not supported.

Another thing that is probably going to make this discussion moot is the 
emergence of UID/GIDs in just about every system I've been writing that 
uses cred.  It seems that a very likely pattern is that every user has a 
numerical ID (RDBMS primary key, UNIX uid, storq storage ID, ZODB/cog 
oid) and you should not even use usernames as avatar IDs at all if you 
can avoid it.

comment:21 Changed 14 years ago by Moshe Zadka

<moshez> glyph: transporting SHIFT_JIS correctly across unicode is a fairly
         non-trivial task
<glyph> moshez: craptastic
<glyph> moshez: python's JIS encodings won't do it for you?
<moshez> that's why I chose SHIFT-JIS
<moshez> glyph: my understanding is that JIS->Unicode is a political issue
         rife with difficulties centering around the difference between lots
         of subtle concepts I've no idea about like the difference between a
         character and a code point
<glyph> moshez: well, my point is, there is *SOME* way to encode what you want
        as a string
<glyph> moshez: so the *convention* should be UTF-8
<glyph> if you can't do UTF-8, well, that sucks, but it's just a convention
         anyway
<glyph> moshez: okay, but are we in agreement?
<moshez> glyph: well, I still dislike recommending a work-around [use utf-8]
         without  a clear view of the implication
<glyph> moshez: I think we've demonstrated that we have a clear view of 90%
        of the implications
<moshez> glyph: I prefer "bug us with a use case"
<glyph> moshez: they won't
<moshez> glyph: so do you want to formulate a new note, and check it in?
<glyph> moshez: OK.  I'll add something to the documentation tonight.

comment:22 Changed 14 years ago by Moshe Zadka

Adding proposed formulation and marking patch.

Index: doc/howto/cred.xhtml
===================================================================
RCS file: /cvs/Twisted/doc/howto/cred.xhtml,v
retrieving revision 1.5
diff -u -r1.5 cred.xhtml
--- doc/howto/cred.xhtml        17 Oct 2003 04:46:19 -0000      1.5
+++ doc/howto/cred.xhtml        20 Oct 2003 16:42:39 -0000
@@ -128,6 +128,12 @@
 <p>This method will typically be called from 'Portal.login'.  The avatarId
 is the one returned by a CredentialChecker.</p>

+<div class="note">
+Note that <code>avatarId</code> must always be a string. In particular,
+do not use unicode strings. If internationalized support is needed,
+it is recommended to use UTF-8, and take care of decoding in the realm.
+</div>
+
 <p>The important thing to realize about this method is that if it is being
 called, <em>the user has already authenticated</em>.  Therefore, if possible,
 the Realm should create a new user if one does not already exist

comment:23 Changed 14 years ago by Moshe Zadka

Modified files:
Twisted/doc/howto/cred.xhtml 1.6 1.7

Log message:
document internationalization suckage

Fixed.

comment:24 Changed 6 years ago by <automation>

Owner: Glyph deleted
Note: See TracTickets for help on using tickets.