Spell checker

Discussion in 'Open Source Projects' started by moster67, Sep 10, 2008.

  1. moster67

    moster67 Expert Licensed User

    Spell checker (closed)

    Hello,

    EDIT - March 15, 2009 - project has now been transformed into a library. See the library section of this forum.

    EDIT/UPDATE: March 5, 2009 - see below and post 11

    UPDATE: see notes below

    I've been working on this project off and on for the last 10 days. This is my first little "serious" project with Basic4PPC, Windows Mobile and PPC - all in order to learn something new.

    I got the idea to this project while following the other thread "Personal Pocket PC Wiki" since I noted a spell checker was in the "todo/wish-list".

    Basically, a spell checker customarily consists of two parts:

    1) A set of routines for scanning text and extracting words, and
    2) An algorithm for comparing the extracted words against a known list of correctly spelled words (i.e., the dictionary). //Source: Wikipedia

    This is more or less what I have implemented in the source-code that I am attaching to this post although I have also added the possibility to ignore a misspelled word and to replace a misspelled word with an own word (which can also be saved to the dictionary).

    However, what mentioned above is only a "half" spell checker since these days spell checkers also suggest replacements/corrections for misspelled words (among other things such as synonyms and grammar-hints). Said suggestions can be proposed by the program based upon various techniques:

    -phonetic algorithms such as "Soundex" among others.
    -wordlists containing common misspelled words and letters commonly inverted
    -algorithms like "edit distance" which measure the amount of difference between two sequences. A famous one is the "Levenshtein distance"
    -and other techniques

    I am currently working on the next version of this spell checker where some of above mentioned techniques have already been implemented. I have also compiled a library for using Soundex and the Levenshtein distance mentioned above. I will post this library shortly here on the forum. When I think the next version of the spell checker is "ready for testing", I will post the source-code here in this thread.

    I am aware of the fact that (at least) WM6 already offers spelling-suggestions and a spell checker if Word (Office) has been installed but still I liked this idea as a project, so I said to myself "What the heck - let's try"! In any case, as far as I know, only the dictionary corresponding to the language of WM6 is being installed so if you want to spell check words in other languages you cannot do so.

    Please bear in mind that I am hobby-programmer and I am sure my code can be improved in many ways (in terms of speed, efficiency and clarity) but still I am quite pleased with the same.

    So far, I have learned two important lessons:

    1) "to keep your eyes open" and expect to find all kind of errors while debugging. The other day, I "lost" 3 hours of precious time because of a stupid error I had made in the code. I was so frustrated that I wanted to send a PM (I never did) to Agraham and tell him that his Collection-library was full of bugs but of course the error was mine. I didn't notice/remember a string-conversion (strToUpper) I had made in my code

    2) to bear in mind that I am developing for a PPC, Smartphone, handheld (or whatever they are called) and that there is a huge difference in speed and available hardware-resources compared to a normal "desktop-PC". One of the critical parts of the first version posted here is the loading of the dictionary. The dictionary attached has nearly 70000 words and in my first "internal" version it loaded very fast when running on my PC while on my PPC (a Samsung i780 which has a powerful processor with lots of free memory) it took AGES!! This was due to the fact that I read the dictionary line by line instead of using "FileReadToEnd". Another critical part is when I compare the words to spell-check against the words in the dictionary. First, I made a "For-Next iteration" and it took ages. Then I saw Agraham's Collection-library with the option to add IndexOf-search to arrays and everything became so much faster!!! This time, I wanted to send him a PM to praise him for the libraries he is furnishing us with (this is of course valid for all others writing libraries for the Basic4PPC-comunity).

    The version posted here works quite well and is being loaded and executed quickly both on my PC and on my PPC, especially the compiled version. The dictionary included is an English one but you can find other free dictionaries for other languages on the Internet (have a look for instance at: Word lists - download wordlists for free - language dictionary translation cracking passwords - despite the description of the web-page it actually has really good dictionaries online - there are other sites as well - try with Google). Please rename the dictionary to "Eng_Dict.txt" and keep it in the application-directory or you may of course adjust the name in the source-code. You also need Agraham's excellent Collection-library.

    Please feel free to use or modify the code posted here for use in your own applications. If you modify the code to the better or add useful options, please then post the modified/added part of the code here so all of us can benefit from your improvements (this is also why I wrote this post in the Open-Source part of the forum).

    NEW VERSION WITH SUGGESTIONS:

    I have attached to this post a new version of the spell-checker - this time with suggestions. For the time being, I have not included the source and only the executable for windows (desktop). Included in the zip-file, you will also find an English wordlist (other languages will be added later) and some support files. You also need Agraham's StringsEx-library (included). All these files must be in the same directory together with the executable in order to run the program. After the first run of the program, an additional file will be created in the same directory.

    This version is using Soundex for creation of suggestions but not only - it also takes into consideration common typing-errors (based on functions called "Near Miss Strategy" and introduced by one of the first spell-checkers on the market, namely Ispell for UNIX and with its roots dating back to 1971). I have included a text-file called Test.txt which you can load for testing purposes. In overall, I think the spell-checker performs quite well, both in terms of suggesting valid replacements and in terms of speed.

    However what regards performance on the PPC - that's another story :( I am currently trying out different ways of coding in order to boost performance on my PPC but I still have a lot of testing to do. I will in the next days post the source-code, structured differently, and maybe someone can give me a hand to speed up the performance on the PPC.

    Please remember that the user-interface is only present to facilitate the testing of the code and therefore I am not putting any effort into cosmetic issues. Once the code has been finished, the idea is to include it in another project that may require spell-checking. I might one day make a library out of it.

    Please let me know what you think.

    Update - March 5, 2009:

    This project has now been transformed into a library.


    Inputs, suggestions for improvements, bug-reports, test/speed-results and small talk are welcomed!!

    rgds,
    moster67
     
    Last edited: Apr 12, 2009
  2. agraham

    agraham Expert Licensed User

    Good start! A few (or more!) suggestions

    1)At the moment if you edit the text box and press Check a spell check is not done because FirstRun is still False. I would make a KeyPress event for the textbox and set FirstRun to true in there.

    2) Using Goto is generally considered "a bad thing". They are bad because they destroy the structure of the code. Nested Ifs, with indentation are better as you can see the progam flow better.

    3) Also "a bad thing" is jumping back to the start of a loop without passing Next. Depending upon the compiler you might get away with it or you might get misbehaviour. If you really really need to shortcut with a Goto then jump to a label just before the Next. EDIT :- And don't jump out of an inner For, While or Until loop, use Exit to leave a loop early.

    4) ArraysEx.IndexOf does a linear search so finding "Zebra" in a list of 10,000 items needs almost 10,000 comparisons. If you want to speed things up then try ArrayEx.Sort and then ArraysEx.BinarySearch (use the same comparison for both). This would only need about 14 comparisons to find "Zebra", a dramatic difference in speed!

    5) You don't need to Dim local variables, just use them and they are declared automatically for you. Have Tools-> Check for ... variables checked to pick up any typos in local variable names.
     
    Last edited: Sep 11, 2008
  3. moster67

    moster67 Expert Licensed User

    Thank you for your suggestions and the time spent looking into the code.

    You are right. Initially I "disabled" user-input in the textbox until Spell-checking had been completed and the error mentioned by you would not trigger. Somehow, I took away this code from the source-code posted and of course this behaviour now appears. However, your solution seems like a good alternative/solution.

    I know....With next release I will try to get rid of the "GoTo":s

    I had no idea about this...thank you.

    Wow...and I was so pleased that I had managed to speed up my code already by using ArraysEx.IndexOf. I will try to adapt my code in accordance with your suggestions. Speed will become very important when I apply the Soundex- and Levenshtein-algorithms.

    I guess that the "dimming" is due to my Visual Basic-habits (and the Option Explicit). Good point.

    I am now in the office (don't tell my boss that I am writing this during working-hours) but I will see if I can adopt your feedback/suggestion this evening.

    Rgds,
    moster67
     
    Last edited: Sep 11, 2008
  4. agraham

    agraham Expert Licensed User

    Usually "Next" is where the loop variable is incremented and tested and the branch made back to the beginning of the loop if the termination condition is not satisfied. Bypassing "Next" means that you re-run the loop code a second time with the previous loop variable value. This may or may not cause problems depending upon what that code is doing.
     
  5. moster67

    moster67 Expert Licensed User

    new version

    Finally I have started programming again after weeks of other stuff which required my attention.

    A new version, this time with suggestions, has been added for download to the first post along with some details about it.

    rgds,
    /moster67
     
  6. moster67

    moster67 Expert Licensed User

    Benchmark test

    Can you all please do me a favour and run the attached application on your PPC?

    I believe I have sorted out the performance-problem on the PPC. The spell-checker with suggestions now runs quite fast on my PPC (Samsung i780) with satisfactory speed/results, at least compared to the previous version. However, I would like to make a benchmark-test and see how it performs on other devices.

    Please help me out by:

    1) unzipping the attached file into a new directory on your PPC. Please ensure that all files are in the same directory.
    2) getting a piece of paper and a pen
    3) running the application named "PerformanceTest.txt". The first time it is executed, the loading of the application will be real slow. This is because it is generating a HashKeyMap which is needed for successive runs. Once the program has been loaded, a message-box will pop up indicating starting- and ending-time related to the loading of the application. Please write down said values on your piece of paper or why not just the result expressed in seconds (I am leaving the maths to you). You will now see some sample sentences in the textbox. Please click on the test-button and we will now check the performance on your device for getting valid suggestions. Hopefully within a not too far future you will see another message-box popping up. Again, please take note of the starting- and ending-time and write it down on your piece of paper (or just the result expressed in seconds). After this, another messagebox should appear indicating how many suggestions were produced - don't bother to write down this number since it has no sense and is not needed. Finally, exit the application.
    4) running the application again. This time the loading should be faster because necessary data is being read from HashKeyMap generated during the first run. Please proceed as you did before and take note of the information given.
    5) posting your results in this thread as follows:

    -Name of PPC/device (for instance Samsung i780)
    -1st run -loading time expressed in seconds
    -suggestions time expressed in seconds
    -2nd run -loading time expressed in seconds
    -suggestions time expressed in seconds
    -Notes: - any comments you wish to add

    If you want to re-run the test, please ensure to delete the HashKeyMap-file before doing so.

    My test results were as follows:

    -Samsung i780
    -1st run - loading time - 75 seconds
    - suggestions time - 4 seconds
    -2nd run - loading time - 5 seconds
    - suggestions time - 12 seconds
    -Notes: - I ran the application from the Storage-card while charging the battery

    It should be noted that this test-application doesn't stop to show you the retrieved suggestions as it would normally. Another thing to take into consideration is that the measuring of time also includes parsing which is not related to getting suggestions.


    I thank you in advance for your kind help in running this benchmark-test and I hope you will be in many doing so.

    Regards,
    moster67
     
    Last edited: Mar 15, 2009
  7. Zenerdiode

    Zenerdiode Active Member Licensed User

    monster67,

    Here's some results for you - I'm sorry I didn't see your post until today.

    HP iPaq hx2190b - PXA270 312MHz WM5
    98
    5
    13
    18
    MMC Storage card with AC power

    HP iPaq hx2190b - PXA270 312MHz WM5
    91
    6
    8
    18
    Internal Memory with AC power

    HP iPaq hx4700 - PXA270 624MHz WM 2003SE
    75
    5
    6
    12
    MMC Storage card with AC power

    HP iPaq hx4700 - PXA270 624MHz WM 2003SE
    70
    4
    5
    12
    Internal Memory with AC power

    TDS Recon - PXA255 200MHz WM5
    138(!)
    9
    13
    27
    Internal Memory with AC power

    A range of processors and operating systems there for you. I suggest that you show your main form first and tell the user that there will be a short delay. Are you using the WaitCursor or is it just the system implementing that? On all my devices, the wait cursor stops spinning so it will appear to the end user that the system hangs.
     
  8. moster67

    moster67 Expert Licensed User

    Zenerdiode - Thank you very much for your test-results! It was great that you had also the possibility to run the test on various devices and that you also ran it in different locations (internal memory and Storage). Very appreciated!!

    Based on your results (which confirm mine), I have the following comments:

    In the 1st run, a hashtable (and also saved as a file) is created and that's why it takes so long to load the application but once loaded in memory it is faster than the 2nd run in providing valid suggestions (the 1st run is using hashkey for searching while the 2nd run uses binary-search).

    In the 2nd run, I load the hashtable created in the 1st run and the loading is much faster (in this case I am not loading it as a hastable) but the time needed for providing valid suggestions is longer than in the 1st run (this time I am using binary-search).

    All functions mentioned above may be found in Agraham's Collection-library.

    While running above procedures on a desktop-PC with all its power, you don't note any differences because everything runs at the speed of the lightning but on the PPC you note the difference in speed very much.

    The loading of the Hashtable (I am talking about nearly 70,000 words) on the PPC takes too long and therefore, even if the the time needed for providing valid suggestions takes less than using binary-search, I will use the procedure executed in the 2nd run. In my test-application, the time indicated for retrieving suggestions was related to all words. Normally, the spell-checker will stop after finding a wrong word so the time needed, even using the 2nd method and binary-search, is more or less immediate on the PPC. The only drawback is that the application which furnish spell-checking must include not only the dictionay (word-list) but also a previous generated HashKeyMap.

    I think that the ideal solution would be to be able to load the HaskKeyMap somehow during the boot of the device and then use HashKey-search but I don't know how to do that yet (if it's possible). I guess that Microsoft Pocket-WORD and its spell-checker uses a similar method although I am not sure.

    Anyway, I will shortly post a new version of the Spell-Checker with source-code and probably with support for other languages.

    BTW: I don't recall if I used the WaitCursor or not (I am in the office and I don't have the code available) but I got your point. As mentioned in a previous post, the interface of the spell-checker is less-important for me since the code is meant to be integrated into another application that may need spell-checking and that application should take into cosideration similar issues. Therefore, you are right that my benchmark-application should have foreseen this.

    Regards,
    moster67
     
    Last edited: Oct 23, 2008
  9. klaus

    klaus Expert Licensed User

    Hi moster97

    The results on my Qtek 9090 WM2003 are the following
    Alone without AC, main memory:
    91
    5
    7
    16

    On the AC adapter with ActiveSynch ON
    82 - 97 not always the same
    7
    7
    16

    Best regards.
     
  10. moster67

    moster67 Expert Licensed User

    Thank you Klaus.

    Rgds,
    Moster67
     
  11. moster67

    moster67 Expert Licensed User

    An update:

    finally I have been able to look into this project of mine again.

    Well, I have reduced the loading time of the dictionary-files dramatically and now both are loaded within 3 seconds on my device.

    Anyway, as already mentioned in a previous post of mine, the big news is that I have transformed this project into a class-library written in VB.NET. It works so far without any problems but before releasing it, I just need to add two or three missing features. I have tested the library in a sample-project in VB.NET and it works nicely although of course there may be a few bugs not yet discovered. I will most likely make a posting on XDA where developers can download it and test it.

    The bad news: While the loading of the data works allright in a Basic4PPC-project, the spellchecking-engine does not:( This is due to the textbox-control of Basic4PPC which must be passed on to the library. However, I will later post a question in the Library-section of this forum and see if anyone can help to get the library compatible with Basic4PPC.

    Regards,
    moster67
     
    Last edited: Mar 5, 2009
  12. Erel

    Erel Administrator Staff Member Licensed User

    It shouldn't be too hard to pass a textbox control from Basic4ppc to an external library.
    A VB .Net Sub:
    Code:
    Public Sub DoSomethingWithTextBox(textBox1 As TextBox)
     textBox1.Text = 
    "test"
    End Sub
    In Basic4ppc:
    Code:
    YourLibrary.DoSomthingWithTextBox("TextBox1")
     
  13. moster67

    moster67 Expert Licensed User

    I already resolved it. I did as you suggested but the trick was to pass on the TextBox with double apostrophes around the name of the textbox:

    Code:
    YourLibrary.DoSomthingWithTextBox("TextBox1")
    From my test-application in VB.NET it works without the double apostrophes:

    Code:
    YourLibrary.DoSomthingWithTextBox(TextBox1)
    Thank you Erel
     
  14. tsteward

    tsteward Active Member Licensed User

    Looking forward to the version, keep up the good work!
     
  15. moster67

    moster67 Expert Licensed User

    Just to inform you that this project has now been transformed into a library which may found in the Additional Library section of this forum.

    Regards,
    moster67
     
Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice