FAQ
I'm coming from the world of having perfected grep searching,
and now I'm using solr with great results so far. Still, I have some
searches I'm trying to figure out.

My users might be searching for:

Exustar SM312 Carbon Mountain Bike Shoes

by typing in the search "sm3 carbon". Then want to see
all the SM312, SM345, SM389 and other matching shoes.

Right now, my solr search misses all of these shoes. My solr
setup is pretty standard, I've only added these filters to my
main text search field:

to index:
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>

to query:
<filter class="solr.SnowballPorterFilterFactory" language="English" />

Any recommendations on how to handle wildcarding in this case?

Search Discussions

  • Upayavira at Nov 15, 2012 at 8:54 pm
    You could use a solr.EdgeNGramFilterFactory, which would index
    edge-ngrams, thus:

    Exustar:
    ex
    exu
    exus
    exust
    exusta
    exustar

    SM312
    sm
    sm3
    sm31
    sm312

    etc

    This would make your indexing slower, and your index bigger, but your
    queries would be faster and you wouldn't need to infix wildcard *
    characters within your query string (which is always risky).

    Upayavira
    On Thu, Nov 15, 2012, at 05:44 PM, David Alyea wrote:
    I'm coming from the world of having perfected grep searching,
    and now I'm using solr with great results so far. Still, I have some
    searches I'm trying to figure out.

    My users might be searching for:

    Exustar SM312 Carbon Mountain Bike Shoes

    by typing in the search "sm3 carbon". Then want to see
    all the SM312, SM345, SM389 and other matching shoes.

    Right now, my solr search misses all of these shoes. My solr
    setup is pretty standard, I've only added these filters to my
    main text search field:

    to index:
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.KStemFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>

    to query:
    <filter class="solr.SnowballPorterFilterFactory" language="English" />

    Any recommendations on how to handle wildcarding in this case?
  • David Alyea at Nov 15, 2012 at 9:09 pm
    Thanks for the suggestion. I tried that, putting it in
    my index section. I deleted the entire collection, stopped solr,
    restarted it, added back all my documents, and committed.
    Basically, I followed the exact same protocol I always do.
    When I searched for "sm3 carbon" I got nothing - if I just
    searched for "carbon", nothing. Looking at the solr admin
    panel, the document count was right. But any select I ran
    returned 0 results. So something about that in the index
    section didn't work. Any ideas? I definitely like the way
    this would potentially work.
    On Thu, Nov 15, 2012 at 3:54 PM, Upayavira wrote:

    You could use a solr.EdgeNGramFilterFactory, which would index
    edge-ngrams, thus:

    Exustar:
    ex
    exu
    exus
    exust
    exusta
    exustar

    SM312
    sm
    sm3
    sm31
    sm312

    etc

    This would make your indexing slower, and your index bigger, but your
    queries would be faster and you wouldn't need to infix wildcard *
    characters within your query string (which is always risky).

    Upayavira
    On Thu, Nov 15, 2012, at 05:44 PM, David Alyea wrote:
    I'm coming from the world of having perfected grep searching,
    and now I'm using solr with great results so far. Still, I have some
    searches I'm trying to figure out.

    My users might be searching for:

    Exustar SM312 Carbon Mountain Bike Shoes

    by typing in the search "sm3 carbon". Then want to see
    all the SM312, SM345, SM389 and other matching shoes.

    Right now, my solr search misses all of these shoes. My solr
    setup is pretty standard, I've only added these filters to my
    main text search field:

    to index:
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.KStemFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>

    to query:
    <filter class="solr.SnowballPorterFilterFactory" language="English" />

    Any recommendations on how to handle wildcarding in this case?
  • Robert Muir at Nov 15, 2012 at 9:13 pm

    On Thu, Nov 15, 2012 at 9:44 AM, David Alyea wrote:
    to index:
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.KStemFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>

    to query:
    <filter class="solr.SnowballPorterFilterFactory" language="English" />
    I don't think its a good idea to use 4 different stemming algorithms
    (porter1, kstem, plural at index-time) and porter2 at query-time.
    This means you are analyzing terms in a totally different way at index
    time than you are at query-time.

    Just pick one of them: make your index-time and query-time analysis
    the same as a start and I think you will see less surprises.
  • David Alyea at Nov 15, 2012 at 9:49 pm
    OK, I tried that. Had just Snowball and EdgeNGram
    in both index and query. When I ran the "sm3 carbon"
    select, it went from 3,500 matches to 89,000! So yes,
    that edge building works! But too much. And... the
    top score matches didn't look at all like "sm3 carbon"
    products, and the shoes were no where in sight. So,
    I'll toy with it on a dev instance and see what I see.
    I definitely like the idea and I can see that N-gram
    tokens are going to behave like wildcarding.
    On Thu, Nov 15, 2012 at 4:13 PM, Robert Muir wrote:
    On Thu, Nov 15, 2012 at 9:44 AM, David Alyea wrote:

    to index:
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.KStemFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>

    to query:
    <filter class="solr.SnowballPorterFilterFactory" language="English" />
    I don't think its a good idea to use 4 different stemming algorithms
    (porter1, kstem, plural at index-time) and porter2 at query-time.
    This means you are analyzing terms in a totally different way at index
    time than you are at query-time.

    Just pick one of them: make your index-time and query-time analysis
    the same as a start and I think you will see less surprises.
  • Upayavira at Nov 15, 2012 at 9:59 pm
    Remember to distinguish between recall and precision - you're likely to
    get too many results, but what matters is whether the first ones are
    useful.

    You could have two versions of your field, one with normal stemming,
    another with n-grams, and boost the normal field above the n-gram one,
    give exact matches a boost above inexact matches.

    Upayavira
    On Thu, Nov 15, 2012, at 09:48 PM, David Alyea wrote:
    OK, I tried that. Had just Snowball and EdgeNGram
    in both index and query. When I ran the "sm3 carbon"
    select, it went from 3,500 matches to 89,000! So yes,
    that edge building works! But too much. And... the
    top score matches didn't look at all like "sm3 carbon"
    products, and the shoes were no where in sight. So,
    I'll toy with it on a dev instance and see what I see.
    I definitely like the idea and I can see that N-gram
    tokens are going to behave like wildcarding.
    On Thu, Nov 15, 2012 at 4:13 PM, Robert Muir wrote:
    On Thu, Nov 15, 2012 at 9:44 AM, David Alyea wrote:

    to index:
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.KStemFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>

    to query:
    <filter class="solr.SnowballPorterFilterFactory" language="English" />
    I don't think its a good idea to use 4 different stemming algorithms
    (porter1, kstem, plural at index-time) and porter2 at query-time.
    This means you are analyzing terms in a totally different way at index
    time than you are at query-time.

    Just pick one of them: make your index-time and query-time analysis
    the same as a start and I think you will see less surprises.
  • David Alyea at Nov 16, 2012 at 4:34 pm
    I ended up with this:

    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6"
    side="front"/>

    and it works great! It's important to specify side or the N-gram
    buildout is really huge. My users generally will start typing their
    wildcard searches left-anchored, so it was not only overkill to
    have all the generated stems, but was causing way too many
    false positives to hit.

    To provide some on-the-fly documentation of the above, if
    you have:

    sm333k carbon shoes

    the tokens generated, given my specs above, are:

    sm3 sm33 sm333 sm333k car carb carbo carbon sho shoe shoes

    For a word with 7+ characters, it would make the 4 N-grams
    of length 3 to 6 starting with the 1st char. It's like:

    for (i=3..6) {
    token=substr(x, 0, i);
    }

    Thanks for pointing me in this direction!
    On Thu, Nov 15, 2012 at 4:59 PM, Upayavira wrote:

    Remember to distinguish between recall and precision - you're likely to
    get too many results, but what matters is whether the first ones are
    useful.

    You could have two versions of your field, one with normal stemming,
    another with n-grams, and boost the normal field above the n-gram one,
    give exact matches a boost above inexact matches.

    Upayavira
    On Thu, Nov 15, 2012, at 09:48 PM, David Alyea wrote:
    OK, I tried that. Had just Snowball and EdgeNGram
    in both index and query. When I ran the "sm3 carbon"
    select, it went from 3,500 matches to 89,000! So yes,
    that edge building works! But too much. And... the
    top score matches didn't look at all like "sm3 carbon"
    products, and the shoes were no where in sight. So,
    I'll toy with it on a dev instance and see what I see.
    I definitely like the idea and I can see that N-gram
    tokens are going to behave like wildcarding.
    On Thu, Nov 15, 2012 at 4:13 PM, Robert Muir wrote:
    On Thu, Nov 15, 2012 at 9:44 AM, David Alyea wrote:

    to index:
    <filter class="solr.PorterStemFilterFactory"/>
    <filter class="solr.KStemFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>

    to query:
    <filter class="solr.SnowballPorterFilterFactory" language="English"
    />
    I don't think its a good idea to use 4 different stemming algorithms
    (porter1, kstem, plural at index-time) and porter2 at query-time.
    This means you are analyzing terms in a totally different way at index
    time than you are at query-time.

    Just pick one of them: make your index-time and query-time analysis
    the same as a start and I think you will see less surprises.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgeneral @
categorieslucene
postedNov 15, '12 at 5:45p
activeNov 16, '12 at 4:34p
posts7
users3
websitelucene.apache.org

People

Translate

site design / logo © 2018 Grokbase