Re: Fuzzy Match Between Two LARGE Lookups

genesiusj · ‎12-18-2024

Hello,

We have a lookup csv file: 1 million records (data1); and a kvstore: 3 million records (data2). We need to compare a street address in data2 with a fuzzy match of the street address in data1 - the bold red text below -returning the property owner.
Ex"
data2 street address: 123 main street
data1 street address: 123 main street apt 13

We ran a regular lookup command and this took well over 7 hours. We have tried creating a sub-address (data1a) removing the apt/unit numbers, but still a 7 hour search. Plus if there is more than one apt/unit at the address, there might be more than one property owner. This is why a fuzzy-type compare is what we are looking for.

Hope my explanation is clear. Ask if not.
Thanks and God bless,
Genesius (Merry Christmas and Happy Holidays)

PickleRick · ‎12-18-2024

You might double-check that but if I remember correctly csv lookups do linear search through contents so in a pessimistic case you'll be doing a million comparisons per each input row only to return a negative match. It has nothing to do with fuzziness.

genesiusj · ‎12-19-2024

Thanks, @PickleRick

I understand about the linear search nature of lookups. I was hoping there were perhaps some new commands on the horizon with the most recent (or future) versions of Splunk Enterprise. Or someone might have experience with MLTK, or another Splunk product, to handle this use case.

Thanks and God bless,
Genesius

bowesmana · ‎12-18-2024

When you say fuzzy, do you mean it should match based on similarity using something like Levenshtein distance? Do you want

123 main street
123 maine street
123 cain street

all to match.

I have used this app

https://splunkbase.splunk.com/app/5237

to do fuzzy lookup, but even on small lookups of a few hundred rows, it is very slow - I'm expecting that this is going to run on the search head, due to KV store, so you're going to be doing serial processing. What size is your lookup - you may well be hitting the default limits defined (25MB)

https://docs.splunk.com/Documentation/Splunk/9.3.2/Admin/Limitsconf#.5Blookup.5D

What are you currently doing to be 'fuzzy' so your matches currently work or are you really looking for exact matches somewhere in your data?

Is your KV store currently being updated - and is it replicated? It's probably not replicated, so all the work is on the SH - if you turn on replication (NB: I am not sure how the process works exactly), but the store will get replicated to the indexers as a CSV and if you have multiple indexers, you may benefit from parallel processing.

Also, if you are just looking at some exact match somewhere, then the KV store may benefit from using accelerated fields - that can speed up lookups against the KV store (if that's the way you're doing it) significantly.

https://docs.splunk.com/Documentation/Splunk/9.3.2/Admin/Collectionsconf

What's your search - maybe that can be optimised as well.

genesiusj · ‎12-19-2024

Thanks, @bowesmana .

Q - "When you say fuzzy, do you mean it should match based on similarity using something like Levenshtein distance? Do you want
123 main street
123 maine street
123 cain street
all to match."

A - No. I know about Levenshtein ; however, the similarity would have to disregard (not the correct word) the street numbers in counting/calculating.
123 main street and 124 main street would never be a match.
123 main street and 123 main street apt 2 would be a match.
It is assumed, and probably incorrectly, the property owner of 123 main street apt 4 and 123 main street apt 6 are the same for the building. Of course condos knock this idea out.

Q - "What size is your lookup - you may well be hitting the default limits defined (25MB)"

A - csv: 1 million records - 448,500 bytes // kvstore: 3 million records - 2,743.66 MB

Q - "What are you currently doing to be 'fuzzy' so your matches currently work or are you really looking for exact matches somewhere in your data?"

A - I stripped off any non-numeric characters at the beginning of the address on the lookup and use that field for the as in my lookup command with my kvstore

| lookup my_kvstore addr as mod_addr output owner

Q - Is your KV store currently being updated - and is it replicated?

A - No replication. The data would be refreshed yearly, or possibly every quarter.

Q - Also, if you are just looking at some exact match somewhere, then the KV store may benefit from using accelerated fields - that can speed up lookups against the KV store (if that's the way you're doing it) significantly.

A - Using the above code, the addr would be the accelerated field, correct?

Thanks again for your help and God bless.
Genesius

bowesmana · ‎12-19-2024

I have also had a use case, where I had 8 million rows of user data and I needed to have that enrich index data with data from that 8m row lookup. I ended up as a short term stopgap, indexing the 8m user rows and then did dataset 1 + user data from index and correlated them using stats by xx, because the performance on a lookup of that size was not good enough.

bowesmana · ‎12-19-2024

Thanks for your detailed response.

@genesiusj wrote:
Q - "What size is your lookup - you may well be hitting the default limits defined (25MB)"
A - csv: 1 million records - 448,500 bytes // kvstore: 3 million records - 2,743.66 MB

That seems wrong - 1,000,000 records must be more than 448,500 bytes - there has to be at least a line feed between rows which would give you 1,000,000 bytes.

Anyway, if the CSV is the origin dataset, then I don't think the lookup limit is going to be relevant, but are you doing something like

| inputlookup csv
| eval mod_addr=process_address...
| lookup my_kvstore addr as mod_addr output owner

The fact that this is all happening on the search head means that the SH will probably be screaming - what is the size of the SH and have you checked its performance profile during the search?

Q - "What are you currently doing to be 'fuzzy' so your matches currently work or are you really looking for exact matches somewhere in your data?"
A - I stripped off any non-numeric characters at the beginning of the address on the lookup and use that field for the as in my lookup command with my kvstore

| lookup my_kvstore addr as mod_addr output owner

I have in the past done something similar using wine titles, so I have "normalised" the wine title by removing all stop words, all words <= 3 characters, all numbers. I then split to a MV field, convert to lower case then sort and join. I have done this in the base dataset (i.e. your KV store) and also in all wines I see. It is reasonably reliable.

However, that doesn't really solve your issue with the volume...

Q - Also, if you are just looking at some exact match somewhere, then the KV store may benefit from using accelerated fields - that can speed up lookups against the KV store (if that's the way you're doing it) significantly.
A - Using the above code, the addr would be the accelerated field, correct?

Yes, I have seen very good performance improvements with large data sets using accelerated_fields, so do this first. If you have the option to boost the SH specs, that may benefit, but first check what sort of bottleneck you have on the SH.

PickleRick · ‎12-19-2024

Yes, those volumes of data seem off. While in csv case you could argue that it's a gzipped file size (but it still looks a bit low - with typical gzip compression you expect around 1:10 ratio) the KV size is way too small.

Fuzzy Match Between Two LARGE Lookups

lookup

Splunk Enterprise Security 8.0.2 Availability: On cloud and On-premise!

Logs to Metrics

Developer Spotlight with Paul Stout