Being an old (Bell Labs) Unix guy, I love regex. It's unfortunate in this case that the log4j data type doesn't accept the native log4j layout conversion pattern format as specified in the log4j properties files. Trying to translate the log4j output into a regex is proving to be tedious and I often get an error message when it tries parsing "Failed to process 'regex statement' ex:GC overhead limit exceeded".
For example:
In the log4j.properties file the format description looks like this:
log4j.appender.dal.appender.layout.ConversionPatern=%d [%t] %-5p %m %n
A sample line from the log output contains a string that is a complex sql query with its results that would be nice to parse (e.g., separating the query from the results in separate columns). However the query contains a lot of characters that conflict with special regex characters. It would be nice to be able to use the log4j conversion pattern to do an initial parsing, then at the field level use regex to split the field further.
Hi Thom,
Could you post a couple of lines of what you want to extract? For pulling out the other (SQL) fields you may want to consider using synthetics and they will allow you to apply a pattern extraction where data doesn't occur on every line.
Ok, I'd like to start with something a bit more straight forward. Consider the output from wrapper.log. In the application's wrapper.conf file there are some pre-defined output pattern token definitions you can specify:
'L' for log level,
'P' for prefix,
'D' (Since ver. 3.1.0) for thread,
'T' for time,
'Z' for millisecond time,
'R' quite duration milliseconds showing the time since the previous JVM output,
'U' for approximate uptime in seconds
'M' for message.
Some common variations are as follows:
Output Example with the property "wrapper.console.format=LPZM":
STATUS | wrapper | 2001/12/11 13:45:33.560 | --> Wrapper Started as Console
STATUS | wrapper | 2001/12/11 13:45:33.560 | Launching a JVM...
INFO | jvm 1 | 2001/12/11 13:45:35.575 | Initializing...
Output Example with the property "wrapper.logfile.format=LPTM":
STATUS | wrapper | 2001/12/11 13:45:33 | --> Wrapper Started as Console
STATUS | wrapper | 2001/12/11 13:45:33 | Launching a JVM...
INFO | jvm 1 | 2001/12/11 13:45:35 | Initializing...
Output Example with the property "wrapper.logfile.format=PTM":
wrapper | 2001/12/11 13:45:33 | --> Wrapper Started as Console
wrapper | 2001/12/11 13:45:33 | Launching a JVM...
jvm 1 | 2001/12/11 13:45:35 | Initializing...
Output Example with the property "wrapper.logfile.format=PM":
wrapper | --> Wrapper Started as Console
wrapper | Launching a JVM...
jvm 1 | Initializing...
Ok, so I played around with synthetics and was able to create a parser for "wrapper.console.format=LPZM". It would be great if the pattern editor had an "undo" or some sort of versioning capability. I need to read up on synthetics coding. I'm hoping that it supports some fairly complex string processing logic.
So you don't have to delimit the pipe symbol with a backslash and are you saying the max number of tokens you can return is 3?
Going back to my original question with the SQL statement in the message, the log lines looks like this:
2014-09-28 20:07:25,773 [UCMDB - scheduler for customer 1, id name: Default Client] INFO 0ms N/A SELECT SYSDATE FROM DUAL 2014-09-28 20:07:49,008 [802807669@qtp0-2241] INFO 0ms 1144276771 46341656 1960505241 1369989566 775921648 1394433352 SELECT DISCOVERYRESOURCE_0.CMDB_ID FROM DISCOVERYRESOURCE_0 DISCOVERYRESOURCE_0 WHERE 1=1 AND DISCOVERYRESOURCE_0.CUSTOMER_ID=? AND (DISCOVERYRESOURCE_0.A_SUBSYSTEM = ? AND DISCOVERYRESOURCE_0.A_NAME = ? ) ; Values: 1,'discoveryConfigFiles','securitySettingsDocument'
I'm using a pattern that looks like this: ^(2*)\s+(*)\s(\[.*\])\s+(INFO|DEBUG|WARN|ERROR|FATAL)\s+(*ms)\s(**)
I want to parsing to produce these output fields: date, time, thread, level, exTime, msg, query, queryresults
I'm having the following problems:
With the msg field; log line 1 should display "N/A"; Log line 2 should contain: "1144276771 46341656 1960505241 1369989566 775921648 1394433352"
For the query and queryresult columns, log line 1 should display: SELECT SYSDATE FROM DUAL in the "query" column and nothing in the "queryresult" column
Log line 2 should display "SELECT DISCOVERYRESOURCE_0.CMDB_ID FROM DISCOVERYRESOURCE_0 DISCOVERYRESOURCE_0 WHERE 1=1 AND DISCOVERYRESOURCE_0.CUSTOMER_ID=? AND (DISCOVERYRESOURCE_0.A_SUBSYSTEM = ? AND DISCOVERYRESOURCE_0.A_NAME = ? )" in the "query" column and "Values: 1,'discoveryConfigFiles','securitySettingsDocument'"in the "queryresult" column
I've tried a few variations and one observation is that it doesn't like to split at the ';' (if it is there).
I can see you already have a good understanding of how the data types page works. My answer is slightly verbose for the sake of clarity so I may be repeating a few things you already know.
When creating a datatype you have two types of fields, you have the regular fields and the synthetics. The regular fields are mapped sequentially from the groups in your regular expression. The regular fields represent the structure of every log line in your file. Taking your example, the structure of your log file is DATE,TIME,THREAD,LEVEL,exTIME and MSG. A correctly formatted log entry will always have this structure. The regex pattern is used to capture this structure of your logfile.
The synthetic fields are there to extract data that may not occur on every line, the variable part of your logs. The synthetic fields do not have to occur on every line where as the regular fields must match each line for it to be processed. For example, the MSG field will always contain text but the SQLRESULT field will only contain a value if it finds/matches the sql results in the MSG field. If you take a look at the image I have attached, I've highlighted the synthetic fields and the regular fields
1.) Create the sqlResult synth field to extract the text 1144276771 46341656 1960505241 1369989566 775921648 1394433352 SELECT DISCOVERYRESOURCE_0
from the MSG field
Name: sqlResult
synth source: MSG
synth expression: (**)SELECT DISCOVERYRESOURCE
The sqlResult field will be populated with a value after running the synth expression on the value held in the field referenced in the synth source. The first group matched is always used as the value.
2.) Create the synths to extract the values held in sqlResult
With your particular data it was necessary to create the sqlResult synth field before attempting to extract the numerical values.
Name: val1
synth source: sqlResult
synth expression: (\d+)
Val1 will apply the pattern (\d+) on the text 1144276771 46341656 1960505241 1369989566 775921648 1394433352 SELECT DISCOVERYRESOURCE_0
to extract 1144276771
Name: val2
synth source: sqlResult
synth expression: \d+ (\d+)
Val2 will match the second digit from the sqlResult and so on with the other values.
The text diagram crudely expresses how each val1,val2 ,val3 obtain their values from a raw log line.
RAW LOG DATA -> MSG -> sqlResult | -> val1
|->val2
|->val3
3.) Synth Expressions
Synth experssions do not always have to have a synth source. A synth expression can perform numerical calculations fields, it can be a regex pattern, or a groovy and any of the Logscape text functions like split, substring. Here's an example synth expression that does a number conversion.
Name: val1K
synth expression: jep: val1 / 1024
4.) Uploading the example datatype.
You can see how I have put all of this together in the data type skazal which I will email to you. Navigate to the Configure/Backup page and click OVERWRITE to upload the skazal.config file. Once you have done that navigate to the Configure/Datatypes page and click open to search for the skazal datatype.