PhoneticAlgorithms Library (ex-StringComparison Library)

Discussion in 'Additional Libraries' started by moster67, Sep 13, 2008.

  1. moster67

    moster67 Expert Licensed User

    UPDATE: The name of the library is now less misleading than the previous name - as a matter of fact it's a good name:)

    The library now includes 3 algorithms which are very useful for spell-checking, word-games, family-research etc. I needed these algorithms for my spell-checking program, namely:

    - Soundex (a phonetic algorithm)
    - Double Metaphone (another phonetic algorithm)
    - Levenshtein distance (edit distance)

    For further information :
    -about Soundex, pls check out:
    Soundex - Wikipedia, the free encyclopedia
    about Double Metaphone, pls have a look at:
    Double Metaphone - Wikipedia, the free encyclopedia
    -about Levenshtein distance, please see:
    Levenshtein distance - Wikipedia, the free encyclopedia

    Neither the functions nor their related algorithms were written by me. In my initial library, the sources (for Soundex and LD) were in VB.net while this new version has sources in C# which Agraham kindly converted (and corrected) from VB.net to C#. In addition, he also added the Double Metaphone-algorithm mentioned above. In this way, the library can now also be merged into the application.

    The usage should be quite straightforward - just have a look at above web-pages and the syntax of the functions. What regards the Double Metaphone, there are two versions of it, the original returning a string and a numeric version that returns an integer calculated from the string. Please see Agraham's explanation in this very thread.

    If required and as soon as I have some spare-time, I can write down a little helpfile with a demo-project. Otherwise you will surely see this library and its usage implemented in my next version of the Spell-Checker. In a few words, this library (and its functions) is the "engine" for creating suggestions for wrongly spelled words.

    This library served also as a test for me in learning/building libraries. The library is attached along with the source.

    Once again, I want to thank Agraham for his support and help in making this library.:sign0188:

    Rgds,
    moster67
     

    Attached Files:

    Last edited: Sep 17, 2008
    Jorge M A likes this.
  2. agraham

    agraham Expert Licensed User

    Very interesting functions, although I agree the name of library might be better chosen :). For my own interest I have done a literal translation of the VB source into C# to play with and very originally named it SoundexLD!. I don't know whether you are using VS or SharpDevelop but would you like a copy so that you can try building a library in C# and use the source to merge the library into your compiled app?

    EDIT:- You have obviously researched this. Do you know why the Soundex value is only 4 characters long? Intuitively that seems a bit to few for general use. It means "ting" and "ted" have different soundex values but "inventing" and "invented" have the same.
     
    Last edited: Sep 15, 2008
  3. moster67

    moster67 Expert Licensed User

    I agree, I'll see if I can come up with something more precise....

    I'm all ears! Yes, I'd love a copy of your code. If you also could explain how to merge the library into a compiled application that would be favoulos (or a link where it is explained). As a matter of fact, I will probably need another function (another algorithm) for my spell-checker but since I've seen this algorithm only in C#, I will need to make another library since my current library is based on VB.net code. However, if you have already converted my library in C# maybe I can pass on to you in PM the source-code for the new function (which is also in C#) and we can make only one library. I am using Visual Studio.

    It's true that I have researched this but to be frank with you, I really don't know why the Soundex-value is only 4 characters. I haven't studied the algorithm in detail and I doubt I would be able to understand it fully. However, you are right - the Soundex-algorithm is quite old and not always perfect. Initially it was used for comparing sir-names by the US-immigration. There are variations of the algorithm. Another drawback with Soundex is that it is closely related to English and less suitable for use with other languages. Really good information about Soundex can be found here: Soundex - the True Story

    I am now looking into the Double Metaphone-algorithm which is more recent and better suitable for other languages. Here you can read about it: Double Metaphone - Wikipedia, the free encyclopedia - this is the new function mentioned above I'd like to implement as a library.

    I really appreciate the help/advice you have given and the interest you are showing in my little project.

    Rgds,
    moster67
     
  4. agraham

    agraham Expert Licensed User

    You need the C# code for the library with the same name as the library, in this case SoundexLD.dll and SoundexLD.cs. Put the cs file in the "Basic4ppc Desktop/Libraries" folder and that source will be compiled into the final app avoiding the need for the library at run time. This only works on optimised compiled apps. There is no need to put the dll in the Libraries folder, it just needs to be in the apps folder. If you look in the Libraries folder you will see that the source for most of the libraries supplied with B4ppc is there so they get merged in to the app.
    Visual Studio 2005 project attached. This is a device project but I assume your version of VS can cope with devices (not all can). I'd be happy to make a library in C# for you but why not have a go yourself first? If you can do it in VB then it is much the same for C#.

    EDIT:- I followed the link to "Soundex - the True Story" and note the the values returned by the library don't match the examples or the ones from the Soundex Conversion Program. I'll have look why.

    EDIT:- If the "True Story" article has the correct algorithm then I'm afraid that there are several things wrong with the algorithm you used. It never sets PrevCode so it doesn't ignore repeats. It sets the first letter at the start of the algorithm rather than at step 4 and it ignores 0s at step 2 rather than stripping them at step 5. See the WOOLCOCK example of how the W replaces a 0 which in the case of the VB code has been thrown away earlier. I'll do a rewrite.

    EDIT:- Obsolete source code removed from this post. The latest version is with the library in the first post of this thread.
     
    Last edited: Sep 18, 2008
  5. moster67

    moster67 Expert Licensed User

    You are absolutely rigth. I will have a go at it this evening at home.

    I don't know what to say apart from that you're really amazing. I admit that I have not verified the function/algorithm used in my library in full but the results it returned seemed OK, in particular with longer words. As mentioned in my previous post, there are variations floating around and perhaps I got hold of one modified for a specific purpose.

    At least, now I know what I shall be doing this evening once back home from work.

    Rgds,
    moster67
     
    Last edited: Sep 16, 2008
  6. agraham

    agraham Expert Licensed User

    I've rewritten the Soundex algorithm so it now agrees with the samples I've found and got the source for DoubleMetaphone working as a library. I'll post the source for you to include in your library once I've tidied it up and tested a bit more.
     
  7. agraham

    agraham Expert Licensed User

    Here is the C# source with the improved Soundex and DoubleMetaphone algorithms. There are two versions of DM, the original returning a string and a numeric version that returns an integer calculated from the string. The idea of this was to make long series of comparisons faster as numeric comparisons are faster than string comparisons. However this is not necessarily true for B4ppc as it holds integers as strings anyway and converts them for comparisons so there will be little, if any, speed benefit.

    EDIT:- Obsolete source code removed from this post. The latest version is with the library in the first post of this thread.
     
    Last edited: Sep 18, 2008
    Jorge M A likes this.
  8. moster67

    moster67 Expert Licensed User

    Fantastic work !!

    This will keep me busy tonight and if I manage to compile the library, I will post it later here.

    Thank you Agraham.

    rgds,
    moster67
     
  9. moster67

    moster67 Expert Licensed User

    Updated version of the library

    There is now an updated version of the library. Please refer to the first post in this thread.

    My sincerest thanks goes to Agraham for his help and support.

    Rgds,
    moster67
     
  10. Georg

    Georg Member Licensed User

    Helpfile?

    Hi,

    a helpfile would be helpfull I think :)
     
  11. moster67

    moster67 Expert Licensed User

    Hi there,

    In this moment I am rather busy with other stuff (not related to programming). However, I will hopefully soon post an update to the Spell-checker project where this library is being used. I will try at the same time make a helpfile and a demo-project for this library.

    Rgds,
    moster67

     
Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice