Opened 3 years ago

Closed 3 years ago

Last modified 12 months ago

#7697 enhancement closed fixed (fixed)

JSON logging output should conform to draft-ietf-json-text-sequence

Reported by: Wilfredo Sánchez Vega Owned by:
Priority: high Milestone: New Logging System
Component: logger Keywords: logging
Cc: Glyph Branch: branches/log-json-7697
branch-diff, diff-cov, branch-cov, buildbot
Author:

Description

See: https://tools.ietf.org/html/draft-ietf-json-text-sequence

Basically what this spec says is that one should emit JSON to a text file in this form:

RS + JSON + LF

a.k.a.:

U+001E + JSON + U+000A

Which enables one to detect corrupt (eg. partial) entries and recover, which seems like a useful idea.

Change History (46)

comment:1 Changed 3 years ago by Wilfredo Sánchez Vega

Milestone: Twisted 14.1.0
Priority: normalhigh

I'd say it's important that we fix the output format now. Dealing with corrupt entries properly can be a follow-on ticket if necessary, but we should get the format in place now so we don't have to change it later.

Tagging for 14.1 so we don't have to have some compatibility hooey.

comment:2 Changed 3 years ago by Glyph

This seems to run counter to the popular practice of simply delimiting JSON lines with CR/LF. How does using RS instead of CR help detect partial entries?

comment:3 Changed 3 years ago by Glyph

(Not saying we shouldn't implement it, compliance with standards is good, but it doesn't seem like the sort of thing most people are going to want to ship in their log formats, since e.g. Logstash already has support for the newline format and not, as far as I know, the RS format.)

comment:4 Changed 3 years ago by Wilfredo Sánchez Vega

I think the reason for the leading RS is that if your process writes a partial entry out, in the terminating CR/LF format you will loose the delimiter entirely, which corrupts the entry you are writing *and* the following one, whereas in the proposed format, if a parser sees RS twice without an intervening LF, it can detect the truncation at the second RS and still delimit the start of the next entry.

comment:5 Changed 3 years ago by Wilfredo Sánchez Vega

Basically, I think the parsing logic is:

  • split on RS
  • If last character is LF, parse up to LF as JSON
  • otherwise treat as truncated line (toss it, or attempt some recovery)

comment:6 Changed 3 years ago by hawkowl

Milestone: Twisted 14.1.0New Logging System

comment:7 Changed 3 years ago by Glyph

It's worth noting that json.loads will happily skip spaces and newlines, but will not skip a record separator, instead giving an ugly "No JSON object could be decoded", making me further think that this is of dubious value... it also seems that logstash can't ingest this format without tweaking, either (although configuring logstash is a bit of a black art so I might just be missing something). This increasingly convinces me that the newline format should really be the default and this should be a separate add-on later.

comment:8 Changed 3 years ago by Wilfredo Sánchez Vega

That json.loads doesn't skip over RS doesn't seem terribly relevant as to whether this is of any value or not.

RS is used as a record separator specifically because it is not allowed in JSON data without being encoded, therefore it's a reliable separator for entries in a JSON stream.

The value it adds is that it makes it possible to detect a truncated JSON text sequence, which can happen for a number of reasons when writing to a file. You can then properly decode the next sequence, losing only one, instead of two. Filtering out the RS characters is rather trivial if you can't write the code to read this back out, but you can't add them back in after the fact.

This draft is in the final stages of review and will be an RFC soon, so I'd be surprised if the logstash developers decide that it's not worth supporting at all, especially since it actually does something useful... But again, it's pretty trivial to filter the RS characters out.

FWIW, I filed: https://github.com/elasticsearch/logstash/issues/2381

comment:9 Changed 3 years ago by Glyph

If you can convince the logstash guys then this might be a reasonable thing to do by default - but as it stands, all the log aggregators I know of will only work with newline-separated-but-no-newlines-internally JSON logs. Older versions of logstash don't let you customize the separator so I think you will have to write some regexes, and the fact that you can't type a RS into a web form might be an impediment.

Basically I think this is the right call technically (see my comment on the logstash issue about pretty-printing) but I don't want to give administrators a bad experience by default just so we can say we support a spec.

comment:10 Changed 3 years ago by Wilfredo Sánchez Vega

Yeah, I was prepared to blow off the spec (Cyrus put it in front of me, naturally), but I think it's actually useful.

I'm thinking we can make the RS optional in the output and handle either when reading, so that in twistd you can have a choice, if we want that as an option.

comment:11 in reply to:  10 ; Changed 3 years ago by Glyph

Replying to wsanchez:

I'm thinking we can make the RS optional in the output and handle either when reading, so that in twistd you can have a choice, if we want that as an option.

Sounds like we're on the same page, then :-).

I guess on the input side we would decide whether we expect it or not based on the first octet?

comment:12 in reply to:  11 Changed 3 years ago by Wilfredo Sánchez Vega

Replying to glyph:

I guess on the input side we would decide whether we expect it or not based on the first octet?

Yeah, that sounds like the way to go.

comment:13 Changed 3 years ago by Wilfredo Sánchez Vega

Owner: set to Wilfredo Sánchez Vega
Status: newassigned

comment:14 Changed 3 years ago by Wilfredo Sánchez Vega

Branch: branches/log-json-7697
Keywords: logging added

comment:15 Changed 3 years ago by Wilfredo Sánchez Vega

Keywords: review added
Owner: Wilfredo Sánchez Vega deleted
Status: assignednew

Builders kicked.

comment:16 Changed 3 years ago by Adi Roiban

I guess that this patch is not based on trunk... as in gihub diff I get 7,893 additions and 535 deletions.

I assume that the diff for review is https://github.com/twisted/twisted/compare/81f1719...log-json-7697

Please update the description of the review request with information about the branch in which this ticket will be merged or what diff needs review

Thanks!

comment:17 Changed 3 years ago by Wilfredo Sánchez Vega

Sorry all bugs on the "New Logging System" milestone are branched from source:branches/log-booyah-6750-5.

The diff for review here is:

https://twistedmatrix.com/trac/changeset?reponame=&new=43777%40branches%2Flog-json-7697&old=43688%40branches%2Flog-booyah-6750-5

which is discoverable by looking at the revision log for source:branches/log-json-7697 in Trac, and viewing the changes.

comment:18 Changed 3 years ago by Adi Roiban

Thanks for the work on this. General code looks good.

here are my initial comments

Don't know what to say about

text = bytes(record).decode("utf-8")

and

buffer += newData.encode("utf-8")

and how it behaves when the chuck read contains only a fragment of utf-8 encoded character... I guess that you will end up with an decoding error.

beside splitting chunks by record separator, I think that we should also check that chuck contains valid utf-8 encoded characters and accumulate partial character with the next chunk.


Minor comment:

Regarding rs I prefer to have the public argument name as resourceSeparator. It is much easier to read and will reduce the need to read further for the docs... if we have rs, when we should also have 'of' instead of 'outFile'.

Not sure what is the Twisted naming policy, but I prefer code which is easier to read.


# Default to simpler newline-separated stream. 
rs = b"" 

This comment is hard to read as I see empty string instead of new line... Maybe reprase it to: Set to empty string which will be expanded later as new line character.

Why not set rs = b'\n' and the if rs == b'\n':

I think that is better to write code which speak for itself and don't require extra comments.

Maybe I am missing something here


For if rs == b"": maybe you can write if not rs: I find it more idiomatic.


Minor comment, regarding this code from tests.

400	        finally: 
401	            fileHandle.close() 

to make the tests easier to read, one options it to define fileHandler as self.fileHandler and then close it in tearDown, when defined. ... or use self.addCleanup(lambda: file.handler.close()) to reduce the level of indentation.


Another minor comment

Instead of

self.assertRaises(StopIteration, next, events)  # No more events

maybe you can extract this into a domain specific language and have

self.assertNoMoreEvents(events)

def assertNoMoreEvents(self, events):
   self.assertRaises(StopIteration, next, events)

this should reduce the need of commenting each line as the method name replaces the comment.

I tried to avoid writing comments, but rather let the code speak for itself.


L{eventsFromJSONLogFile} reads events from a file and automaticallydetects use of C{u"\x1e"} as the record separator.

self._readEvents(fileHandle)

I find it hard to read this test.

Maybe use self._readEvents(fileHandle, rs=None)


Another comment: Do we use unicode string marker in docstrings? C{u"\x1e"} vs C{"\x1e"}


This is not a full review but I hope it is helpful.

comment:19 Changed 3 years ago by Adi Roiban

Here is an example

ca = 'pr\xc8\x9ba'

# All good.
ca.decode('utf-8')
# chuck decode failure
ca[:-2].decode('utf-8')

problem is described here https://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

one fix is discussed here with from codecs import iterdecode

http://blog.etianen.com/blog/2013/10/05/python-unicode-streams/

Hope this helps

comment:20 Changed 3 years ago by Wilfredo Sánchez Vega

adiroiban: awesome feedback, thanks.

comment:21 Changed 3 years ago by Adi Roiban

Keywords: review removed
Owner: set to Wilfredo Sánchez Vega

I think that the unicode chunk read problem is serious enough and require a rewrite of the patch so there is no point in keeping it for review.

Also, I don't know how to proceed with new logging system tickets. Do they require a news file?

Thanks!

comment:22 Changed 3 years ago by Wilfredo Sánchez Vega

Status: newassigned

Yeah, not in review right now, I'll re-accept the bug.

Merging to trunk will require a news file. I'll write that when all the dependent bugs are closed and we're ready to merge to trunk.

Last edited 3 years ago by Wilfredo Sánchez Vega (previous) (diff)

comment:23 Changed 3 years ago by Wilfredo Sánchez Vega

In writing a test for the truncated unicode, I had to feed in a file that vends UTF-8 bytes instead of unicodes, and it turns out the code didn't handle byte streams correctly. r43813 adds tests for both stream types, and we see that byte streams are busted.

r43814 fixes the code.

comment:24 Changed 3 years ago by Wilfredo Sánchez Vega

r43815 changes rs to recordSeparator. Pardon the momentary lapse of sanity.

comment:25 Changed 3 years ago by Wilfredo Sánchez Vega

Regarding recordSeparator = b"" thing, I changed the comment to explain that the newline-only format doesn't use a record separator. (r43816)

I can't change recordSeparator == b"" later in the code, because a user passing in recordSeparator argument for a format where the separator is not used (which is what this case is catching) should specify a value of u"", presumably matching what you passed in to jsonFileLogObserver when you created the file.

I then set recordSeparator = b"\n" in that case because the implementation that follows happens to be one in which that happens to work, not because that's actually the record separator expected.

comment:26 Changed 3 years ago by Wilfredo Sánchez Vega

if not recordSeparator: may be more idiomatic, but I find it to be sloppy.

If recordSeparator happens to be None (or some other false value that isn't bytes) here, that's not a match for what I'm looking for, but a bug in the code above.

Such a bug would be masked in the idiomatic code, but would blow up in bytearray.split() later as-is.

Probably not a likely issue here, but I prefer to be as specific as possible for what I'm looking for when I can be, especially when I'm dealing with mixed-type text.

comment:27 Changed 3 years ago by Wilfredo Sánchez Vega

I think you are referring to test_readEventsAutoNoRS in the next comment. I prefer leaving the charge out to indicate that I'm testing for the default value; I'm not testing for None specifically.

Finally, as to whether we use unicode string marker in docstrings, in the logging module I use it religiously. It adds important clarity to what sort of string you are intending to use, and this is especially important in light of differences between Python 2 and Python 3.

Being explicit has saved me a lot of grief. Enough so that I wish it were possible to make pyflakes complain about all uses of unqualified strings. Sadly, it's also important to be able to use native strings at times, and the Python authors didn't think that providing a way to be explicit about that (eg. n"foo") was important, so there's no way to distinguish between native-because-I-want-native vs. native-because-thats-normal.

comment:28 in reply to:  26 Changed 3 years ago by Glyph

Replying to wsanchez:

if not recordSeparator: may be more idiomatic, but I find it to be sloppy.

...

Probably not a likely issue here, but I prefer to be as specific as possible for what I'm looking for when I can be, especially when I'm dealing with mixed-type text.

I completely agree. Always compare as narrowly as possible; if not rs implies "it's okay if this is an empty tuple, or an empty list, or None, or whatever". The common idiom is an antipattern :).

comment:29 Changed 3 years ago by Wilfredo Sánchez Vega

r43819 adds a bufferSize argument to eventsFromJSONLogFile.

r43820 adds two tests:

  • test_readEventsPartialBuffer ensures that partially buffered events are handled correctly.
  • test_readTruncatedUnicodeBytes adds a test where we buffer to the middle of a UTF-8 code point.

The latter test failed when I first wrote it but passes now because the new asBytes nested function doesn't try to encode bytes, so a file stream that vends bytes won't try to decode the text. We also don't attempt to encode the received bytes until we have a full record, so that works.

However, I think we need a test for a record with invalid UTF-8 in it. r43821 adds that test and r43822 fixes the code to pass.

comment:30 Changed 3 years ago by Wilfredo Sánchez Vega

I think that addresses everything… kicking the builders.

comment:31 Changed 3 years ago by Wilfredo Sánchez Vega

Keywords: review added

Back up for review.

comment:32 Changed 3 years ago by Wilfredo Sánchez Vega

Owner: Wilfredo Sánchez Vega deleted
Status: assignednew

comment:33 Changed 3 years ago by Adi Roiban

Thanks for the changes.

Not sure if is ok for the code to silently ignore invalid records.. either due to missing separator or bad json serialization... I guess that if someone needs this, they can extend the current code.

The draft state:

Sequence parsers should have an option to warn about truncated JSON texts.

so I guess that this option can be implemented later :) ... but maybe update the docstring to inform about this behaviour


minor comment for this

if record and record[-1] == 10:  # 10 is b"\n"

maybe you can use ord('\n') or set a private constant _NEW_LINE_ORD = 10

When I see a comment I always try to change the code so that the comment is no longer needed.


Rather than working internally with Unicode, the current code works with UTF-8 bytes. In this way it can handle chunked UTF-8 bytes in a simple way.

Was it that bad/ugly to use iterdecode method?

It goes against my practice of handling Unicode data, ie convert everything to Unicode, work internally with Unicode and encode/decode at system input/output points... and on Python 3 this is counter productive, as read data is already Unicode... so Unicode chunk handling is only needed on Python 2.

I guess that in case someone needs a different encoding, it can be implemented later, but I think that the docstring should inform that only utf-8 is supported.

I am ok with this code as it works and is tested so I am not asking to rewrite it. I was just curios why you went with this implementation.


I am not very happy with the general structure of eventsFromJSONLogFile function as it is long and contains definitions for another 2 functions, one of which is created in a dynamic way.

Why not have a single eventFromRecord... something like this

def _eventFromRecord(record, separator):

    if recordSeparator != b"" and record[-1] != 10:
        log.error(
            u"Unable to read truncated JSON record: {record!r}",
            record=bytes(record)
        )
        return None

    try:
        text = bytes(record).decode("utf-8")
    except UnicodeDecodeError:
        log.error(
            u"Unable to decode UTF-8 for JSON record: {record!r}",
            record=bytes(record)
        )
        return None

    try:
        return eventFromJSON(text)
    except ValueError:
        log.error(
            u"Unable to read JSON record: {record!r}", record=text
        )
        return None

Do you need to explicitly try/finally close StrinIO files? From my point of view is just add unimportant details to the test.

---

I am leaving this in review, as I am just a junior reviewer and I think that this branch is ready to be reviewed by a senior reviewer :)

Thanks!

comment:34 Changed 3 years ago by Wilfredo Sánchez Vega

This code does not silently ignore bad record; it emits a log message warning about truncated messages, which is exactly what is suggested. Further, it's configurable because New Logging System. The caller can even register for these specific warnings and handle them somehow as the parsing is done. But your comment reminds me that log messages are functional, and I see that one case records the busted record as unicode where the others use bytes, and that should be consistent. Fixed in r43832 along with some minor test cleanup (meant to be separate commits, sorry). r43833 adds verification to all reader tests to make sure that the expected logging is happening.

comment:35 Changed 3 years ago by Wilfredo Sánchez Vega

I hadn't thought of ord("\n"). Like. r43834.

comment:36 Changed 3 years ago by Wilfredo Sánchez Vega

The doctoring does say we expect UTF-8. I went this router because I always assume that data from external sources (network/files/etc.) comes in bytes. Perhaps a bigger factor is that I'm not all that experienced with working with Python's Unicode constructs and didn't know about iterdecode. That's probably worth exploring, but perhaps that's best done in a separate ticket. I filed #7769 to track that work.

comment:37 Changed 3 years ago by Wilfredo Sánchez Vega

The try/finally thing is consistent with the rest of the module. I do plan to clean this up, but for now sticking with the same pattern. Noted this in #7712 (thought it was there already).

comment:38 Changed 3 years ago by Adi Roiban

Keywords: logging review removed
Owner: set to Wilfredo Sánchez Vega

I find it strange to use a logging system for error handling/recovering... I see it as an error reporting only system. Feel free to ignore this, it was just my WTF moment :)


I agree that proper Unicode handling should be done in a separate ticket.

I see some test docstrings talking about utf-8, but not the public docstrings for jsonFileLogObserver or eventsFromJSONLogFile


Thanks for the reference to #7712 ... since you are going to clean that anyway, not sure if it is productive to write bad code now, just to be cleaned in a future ticket. I found it much productive to write tests so that you don't have to clean them in a future ticket.


For me FileLogObserverTests is still hard to read.

First, it has the private _observeWrites method. Why private? We are in test code zone, nobody should use this module/package ever in production. There is no need to deprecated things here.

Not sure what is the Twisted convention for this, but I don't see the point of private test methods.

Then _observeWrites as a helper method does a test/assertion. Maybe it should be renamed to assertWritesJSONEvent ... or something similar

Also _observeWrites is not about generic observers, but about a specific JSONObserver... so maybe it should be renemed to _observeWritesJSON .. or something similar :)

It still has the RS acronym in test names

then, beside a test for recordSeparator, there are not test for Unicode/UTF-8 input handling.

The docstring for test_observeWritesRS and test_observeWritesNoRS have a lot of duplication and the important details is presented at the end.


Thanks!

Please take another look at my comments

PS: this ticket has the unofficial "logging" tag. I have removed it. Please make it official and put it back :)... I know that it make sense for you, but I saw exarkun removing non-official tags for other tickets. So I am removing it just to irritate and start a discussion about this, if really required. Sorry for being a PITA! ... there is already a milestone for that.

comment:39 Changed 3 years ago by Wilfredo Sánchez Vega

Keywords: logging added
Status: newassigned

comment:40 Changed 3 years ago by Wilfredo Sánchez Vega

Keywords: review added

Part of why this is interesting is that observers can inspect *objects* (vs. strings) and do things. But sure, it's weird. :-)

I chose for the time being to be consistent about the try/except thing. Either way, it'll get fixed.

r43869 cleans up some names, though this is feeling a bit cosmetic.

I suppose you could argue that private methods in test cases aren't useful because one would never use a test class elsewhere, until someone uses a test class elsewhere. I prefer to avoid such assumptions.

Yeah, those docstrings are similar. The tests are similar. I don't think it's hard for someone reading these docstrings to read to the end, by which we mean the second sentence. I feel like that's OK.

comment:41 Changed 3 years ago by Tom Prince

Owner: Wilfredo Sánchez Vega deleted
Status: assignednew

comment:42 Changed 3 years ago by Adi Roiban

Sorry for the delay in reviewing this... I am happy with the changes, but as I junior reviewer I was expecting for exarkun and glyph or another core developer to have the final word.

From the current conversation I understand that having a custom resource separator is great but I am not sure if 0x1E is the right default value.

Rather than block this patch, I am +1 on merging it and if somebody is not happy with the default value a new ticket can be raised.

Cheers!

comment:43 Changed 3 years ago by Wilfredo Sánchez Vega

Resolution: fixed
Status: newclosed

(In [44073]) Merging branches/log-json-7697

Author: wsanchez Reviewer: adiroiban Fixes: #7697

comment:44 Changed 3 years ago by Wilfredo Sánchez Vega

Thanks for the review adiroiban.

comment:45 Changed 15 months ago by hawkowl

Keywords: review removed

[mass edit] Removing review from closed tickets.

comment:46 Changed 12 months ago by Wilfredo Sánchez Vega

Component: corelogger
Note: See TracTickets for help on using tickets.