Thanks @tscroggins for your upvote and karma points, much appreciated! last few months i was busy, could not spend time for this one. May I know what would be your suggestions about these points pls: 1) I have been thinking to create an app as your suggestion listed below. would you recommend an app or a custom command or simply all important languages unicodes lookup(tamil_unicode_block.csv) uploading to Splunk | makeresults
| eval _raw="இடும்பைக்கு"
| rex max_match=0 "(?<char>.)"
| lookup tamil_unicode_block.csv char output general_category
| eval length=mvcount(mvfilter(NOT match(general_category, "^M"))) 2) i assume that if i encapsulate his below listed python script in that app should be the work-around for this issue in a language agnostic way(this app should work for Tamil or Hindi or Telegu, etc) 3) or any other suggestions pls, thanks. the app idea (your script from previous reply): $SPLUNK_HOME/etc/apps/TA-ucd/bin/ucd_category_lookup.py (this file should be readable and executable by the Splunk user, i.e. have at least mode 0500) #!/usr/bin/env python
import csv
import unicodedata
import sys
def main():
if len(sys.argv) != 3:
print("Usage: python category_lookup.py [char] [category]")
sys.exit(1)
charfield = sys.argv[1]
categoryfield = sys.argv[2]
infile = sys.stdin
outfile = sys.stdout
r = csv.DictReader(infile)
header = r.fieldnames
w = csv.DictWriter(outfile, fieldnames=r.fieldnames)
w.writeheader()
for result in r:
if result[charfield]:
result[categoryfield] = unicodedata.category(result[charfield])
w.writerow(result)
main() $SPLUNK_HOME/etc/apps/TA-ucd/default/transforms.conf [ucd_category_lookup]
external_cmd = ucd_category_lookup.py char category
fields_list = char, category
python.version = python3 $SPLUNK_HOME/etc/apps/TA-ucd/metadata/default.meta []
access = read : [ * ], write : [ admin, power ]
export = system With the app in place, we count 31 non-whitespace characters using the lookup: | makeresults
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| rex max_match=0 "(?<char>.)"
| lookup ucd_category_lookup char output category
| eval length=mvcount(mvfilter(NOT match(category, "^M"))) Since this doesn't depend on a language-specific lookup, it should work with text from the Kural or any other source with characters or glyphs represented by Unicode code points. We can add any logic we'd like to an external lookup script, including counting characters of specific categories directly: | makeresults
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| lookup ucd_count_chars_lookup _raw output count If you'd like to try this approach, I can help with the script, but you may enjoy exploring it yourself first. $SPLUNK_HOME/etc/apps/TA-ucd/bin/ucd_category_lookup.py (this file should be readable and executable by the Splunk user, i.e. have at least mode 0500) #!/usr/bin/env python
import csv
import unicodedata
import sys
def main():
if len(sys.argv) != 3:
print("Usage: python category_lookup.py [char] [category]")
sys.exit(1)
charfield = sys.argv[1]
categoryfield = sys.argv[2]
infile = sys.stdin
outfile = sys.stdout
r = csv.DictReader(infile)
header = r.fieldnames
w = csv.DictWriter(outfile, fieldnames=r.fieldnames)
w.writeheader()
for result in r:
if result[charfield]:
result[categoryfield] = unicodedata.category(result[charfield])
w.writerow(result)
main() $SPLUNK_HOME/etc/apps/TA-ucd/default/transforms.conf [ucd_category_lookup]
external_cmd = ucd_category_lookup.py char category
fields_list = char, category
python.version = python3 $SPLUNK_HOME/etc/apps/TA-ucd/metadata/default.meta []
access = read : [ * ], write : [ admin, power ]
export = system With the app in place, we count 31 non-whitespace characters using the lookup: | makeresults
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| rex max_match=0 "(?<char>.)"
| lookup ucd_category_lookup char output category
| eval length=mvcount(mvfilter(NOT match(category, "^M"))) Since this doesn't depend on a language-specific lookup, it should work with text from the Kural or any other source with characters or glyphs represented by Unicode code points. We can add any logic we'd like to an external lookup script, including counting characters of specific categories directly: | makeresults
| eval _raw="இடும்பைக்கு இடும்பை படுப்பர் இடும்பைக்கு
இடும்பை படாஅ தவர்"
| lookup ucd_count_chars_lookup _raw output count
... View more