Quiz #2: Words counting

Erel · Nov 25, 2012

Question #2a:
The input is a text file with a list of words. One word per line.
You need to print the number of times that each word appears in the file.

Example:
Input file:

this
is
a
simple
question
with
a
simple
answer

Output (note the the order is not important):

this: 1
is: 1
a: 2
simple: 2
...

Question #2b:
Same question as #2a. However the size of the file is 300mb

mc73 · Nov 26, 2012

Erel said:

B4X:

Sub HashCode(s As String) As Int
    Dim h As Int = 0
    For i = 0 To s.Length - 1
        h = 31 * h + Asc(s.CharAt(i))
    Next
    Return Abs(h)
End Sub

Isn't this hash generator, logically equivalent to the 3d array I used, using the asc function too? In the beginning I thought of the n*26^i thing. What stopped me was the case of 20-letters words. Wasn't sure if the os will handle such thing properly. So, I move to limit to 3 letters division.
When you get to count all letters, you lose a bit of time in contrast with the 3d array.
I will try checking both solutions. If somehow my android crashes, it's just fine, I really like the new Nexus

Erel · Nov 26, 2012

This hash function is the same as Java hash function. It is somewhat similar to your array.

Don't miss question 2c: http://www.b4x.com/forum/basic4andr...23685-quiz-2-words-counting-2.html#post137436

mc73 · Nov 26, 2012

Erel said:
Question 2c: If we run this code the program will crash due to a special feature of Android. Why will it crash and what can we do to avoid this crash?

This is not fair, I haven't got the slightest clue about android's special features :sign0148:

thedesolatesoul · Nov 26, 2012

Well your code could crash even without a special feature of android (i havent thought about that yet)
If the data in the file is such that all of the words fall in the same group, but all of them are unique words, your map 'words' will run out of memory.

Also, your code is in Activity_Create, it could crash due to ANR, blocking the UI thread for too long.

COBRASoft · Nov 26, 2012

Erel said:
This solution will (most probably) not work as you cannot hold all the words in memory.

Hi Erel, you don't have to keep ALL words in memory, but only the unique words. My solution should work for all text files you throw at it with normal words.

If you say you want to work with 'unnatural' words, then I suggested already to keep everything in 2 or more files (starting with 'A', 'B', ...). It would work without a problem and still be quite fast too.

Also, your solution won't work if I throw a text at it with words of let's say 500 characters long (e.g. CSV record libes). You will run out of memory and no, you can't know how big the words will be in front. So, textfiles is the only solutions. I would place them in a folder strucure like below. I've made the example tree with 2 levels and 2 characters, but it can easily be changed to work on 3 levels or more, and never run out of RAM.

B4X:

-A
  -A
    -words/files
  -B
    -words/Files
-B
  -A
  -B
  -C
-C
  -...

COBRASoft · Nov 26, 2012

thedesolatesoul said:
Well your code could crash even without a special feature of android (i havent thought about that yet)
If the data in the file is such that all of the words fall in the same group, but all of them are unique words, your map 'words' will run out of memory.

Also, your code is in Activity_Create, it could crash due to ANR, blocking the UI thread for too long.

Not because it is in the 'Create' part, but because it will be blocking the UI for sure for too long. After some time, Android will kill the app because of this.
Avoiding this can be done with background processing like a service or so.

Erel · Nov 27, 2012

Well your code could crash even without a special feature of android (i havent thought about that yet)
If the data in the file is such that all of the words fall in the same group, but all of them are unique words, your map 'words' will run out of memory.

This will not happen with "real" text files not specifically built to break this algorithm. This is exactly the purpose of using a hash function.

Also, your solution won't work if I throw a text at it with words of let's say 500 characters long (e.g. CSV record libes). You will run out of memory and no, you can't know how big the words will be in front.

The length of the words doesn't really matter (assuming that the words are not made of millions of characters) . We know that the text overall size is 300mb. So we split it to smaller chunks. Each chunk will be about 10mb.

About the third question, I was referring to the ANR dialog which Android shows when a UI event is not handled in 5 seconds.

The solution is quite simple. You can add a call to DoEvents in the loop.

thedesolatesoul · Nov 27, 2012

Yes, its true. Well, you asked me how to make it crash

Roger Garstang · Nov 27, 2012

Are All the words real words? Perhaps you could download one of the ENABLE dictionaries or something like Words with Friends and other apps use and have it sorted and indexed. You could even store the indexes of the start of each letter to search faster. In your code all you then do is just store the index number into the dictionary and count, so two numbers are stored instead of text that is really slow to navigate and large to store. You could even merge this with the SQL solution for even faster results.

Roger Garstang · Nov 27, 2012

The dictionary mentioned above is only about 2MB too, so I'm pretty sure a 300MB file is going to have a lot of duplicates. I imagine it could be compressed pretty good too for storage in APK, etc.

edgar_ortiz · Nov 28, 2012

Thats the point:

HTML:

So we split it to smaller chunks. Each chunk will be about 10mb

too many operations using the algorithm of "hash".

Instead you can group words by their length ... make a file for words with the letter "x" having 3,5,7,10+ characters

Regards,

Edgar

Informatix · Nov 28, 2012

Erel said:

B4X:

Sub HashCode(s As String) As Int
   Dim h As Int = 0
   For i = 0 To s.Length - 1
      h = 31 * h + Asc(s.CharAt(i))
   Next
   Return Abs(h)
End Sub

The big problem with this classic algorithm is the number of collisions. In fact, it is not really suited for real languages where the number of unique values can easily go over 100000 due to inflected forms and different spellings (with or without a hyphen, color/colour...) and the character set use non-ASCII characters. These collisions can happen with words starting with the same letters and having the same length, so the only way that I know to reduce the number of collision cases is to increase the prime number in your formula (the following should be enough for most languages):

B4X:

Sub HashCode(s As String) As Long
   Dim h As Long = 0
   For i = 0 To s.Length - 1
      h = 151 * h + (Asc(s.CharAt(i)) - 32)
   Next
   Return h
End Sub

Problem with this modified algorithm: your hashcode is a long integer. Coded in hexadecimal, the average size needed to store it is very close to the word average length. So, is it really a good idea to use a hashcode ?

Quiz #2: Words counting

Erel

B4X founder

mc73

Well-Known Member

Erel

B4X founder

mc73

Well-Known Member

thedesolatesoul

Expert

COBRASoft

Active Member

COBRASoft

Active Member

Erel

B4X founder

thedesolatesoul

Expert

Roger Garstang

Well-Known Member

Roger Garstang

Well-Known Member

edgar_ortiz

Active Member

Informatix

Expert

Similar Threads