Quiz #2: Words counting

Erel

B4X founder
Staff member
Licensed User
Longtime User
Question #2a:
The input is a text file with a list of words. One word per line.
You need to print the number of times that each word appears in the file.

Example:
Input file:
this
is
a
simple
question
with
a
simple
answer
Output (note the the order is not important):
this: 1
is: 1
a: 2
simple: 2
...

Question #2b:
Same question as #2a. However the size of the file is 300mb ;)
 

margret

Well-Known Member
Licensed User
Longtime User
Does the 300mb file have duplicates, or could it? Is it just one word per line? About how many entries are in the 300mb file?

I would sort the file Asending, then set unique on. Now copy to a second file. I would then skip through the unique file in a for next loop, inside that loop, seek the value in the full file and if found use a do while field = value for the count.
 
Last edited:
Upvote 0

imbault

Well-Known Member
Licensed User
Longtime User
my solution using sqlite
cFile shoud be the File
the log will have the answer

B4X:
'Activity module
Sub Process_Globals
   Dim dbSQL As SQL
   Dim cFile As String = "test.txt"
End Sub

Sub Activity_Create(FirstTime As Boolean)
   Dim query As String
   Dim Cursor As Cursor
   Dim i As Int 

   If File.Exists(File.DirAssets, cFile) Then

      dbSQL.Initialize(File.DirDefaultExternal, "words", True)
      query = "CREATE TABLE IF NOT EXISTS words ( word TEXT) ;" 
      dbSQL.ExecNonQuery(query)
      query = "DELETE from words ;"
      dbSQL.ExecNonQuery(query)

      
      ReadTextFile
      query="Select word,count(1) as number from words where word is not null group by word ;"
      Cursor = dbSQL.ExecQuery(query)
      For i = 0 To Cursor.RowCount - 1
         Cursor.Position = i   
         Log(Cursor.GetString("word") & Cursor.GetInt("number")& CRLF)
      Next 
         
   End If   
End Sub


Sub ReadTextFile 
Dim Text As TextReader    
Dim query As String
Text.Initialize(File.OpenInput(File.DirAssets, cFile))   
Dim line As String    
line = Text.ReadLine        
Do While line <> Null                
   line = Text.ReadLine    
   query ="INSERT INTO words (word) VALUES (?)" 
   dbSQL.ExecNonQuery2(query,Array As String(line))

Loop    

Text.Close
End Sub
 
Upvote 0

Beja

Expert
Licensed User
Longtime User
The poor man's answer

dim wordtemp as string
dim words as string
dim wordcount as int

wordtemp = ""
words = ""
wordcount = 0

open "c:\intext.txt" for Input as #1
open "c:\outtext.txt" for append as #2

do while not #1 EOF

Line Input #1, wordtemp
if instr(wordtemp, words) then
wordcount = wordcount + 1
words = words & " " & str(wordcount) & " Times"
loop
out "c:\outtext.txt" words
Close #1
close #2
end if
 
Last edited:
Upvote 0

edgar_ortiz

Active Member
Licensed User
Longtime User
Depending from the sorting algorithm...
1- create 27 files, one for each letter
2- write in each file, the words that start with the corresponding letter
3- sort each file
4- with each sorted file, count records until "change" and write to a "result file"
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
Answer for question a:
B4X:
Dim Words As List = File.ReadList(...)
Dim m As Map
m.Initialize
For Each w As String In Words
 Dim count As Int
 If m.ContainsKey(w) Then count = m.Get(w) Else count = 0
 m.Put(w, count + 1)
Next
'Print the map
For i = 0 To m.Length - 1
 Log(m.GetKeyAt(i) & ":" & m.GetValueAt(i))
Next

About question b.
The problem is that you cannot store all the data in memory.
One approach is to use SQL as suggested by imbault. In this case you actually delegate the problem to SQLite engine which can handle much larger files.

Edgar is close to the solution that I'm thinking of.
The problem with your suggestion is that the words are not distributed evenly. What will happen if 20% of the words start with 'a'. In that case the corresponding file size will be about 60mb. This is still to much to hold in memory (which is required in order to count the words).

I will write my solution tomorrow. Note that it doesn't create any additional files.
 
Upvote 0

COBRASoft

Active Member
Licensed User
Longtime User
For this question, definitely don't sort, takes too long, even for a db engine. Reading everything at once in memory is too big as erel said.

I can think of several solutions for this second 'problem', with or without creating files. The easiest one is not to read the whole file at once like Erel does in his first solution, but to read line by line. Next to that, define an object ('Type' or 'Class') which has the word and the amount of times it occures. For each unique word you read, create a new instance of this 'object', set the amount to 1 and add it to a list. If the word already exists, add 1 to the occurence amount. Move to the next line in the file until end of file.
 
Last edited:
Upvote 0

mc73

Well-Known Member
Licensed User
Longtime User
Ok, here's a try which searches for every single string in a limited area:
B4X:
Sub Globals
    'These global variables will be redeclared each time the activity is created.
    'These variables can only be accessed from this module.
    Dim finalList As ListView
End Sub

Sub Activity_Create(FirstTime As Boolean)
finalList.Initialize ("")
Activity.AddView (finalList,0,0,100%x,100%y)
' Here we can start serial-read the file instead of the following testing list
Dim a As List
a.Initialize 
a.AddAll (Array As String _
("AWORD","BWORD","AWORD","CWORD","MINAS","DWORD","MINA","NIKOS","BWORD","AWORD","MINAS","MINAS","AWORD","BWORD","OMEGA"))

Dim ar(26,27,27) As String 
' instead of this loop we can loop until eof
For k=0 To a.Size -1
    ' s would be the read string
    Dim s As String 
    s=a.Get(k)
    If s.Length =1 Then 
        s=s & Chr(64) & Chr(64)
    Else If s.Length =2 Then
        s=s & Chr(64)
    End If
    Dim tempA As String 
    tempA=ar(Asc(s.CharAt (0))-65,Asc(s.CharAt(1))-64,Asc(s.CharAt (2))-64)
    Dim tempI As Int 
    tempI=tempA.IndexOf (s & ",")
    If tempI=-1 Then
        tempA=tempA & s & ",1" & TAB
    Else
        Dim tempC As Int 
        Dim tempI2 As Int 
        Dim tempI3 As Int 
        Dim tempD As String 
        tempI2=tempA.IndexOf2 (",",tempI+1)
        tempI3=tempA.IndexOf2 (TAB,tempI2+1)
        tempC=tempA.SubString2(tempI2+1,tempI3)
        tempC=tempC+1
        tempD=tempA.SubString2(0,s.Length+1) & tempC & TAB
        If tempI3+1<tempA.Length Then
            tempD=tempD & tempA.SubString (tempI3+1)
        End If
        tempA=tempD
    End If    
    ar(Asc(s.CharAt (0))-65,Asc(s.CharAt(1))-64,Asc(s.CharAt (2))-64)=tempA
Next
' here we would close the loop
For o1=0 To 25
    For o2=0 To 26
        For o3=0 To 26
            If ar(o1,o2,o3).Length >0 Then
                Dim tempObj() As String 
                tempObj=Regex.Split (TAB,ar(o1,o2,o3))
                For omega=0 To tempObj.Length -1
                    finalList.AddSingleLine (tempObj(omega))
                Next
            End If
        Next
    Next
Next
End Sub
Who knows, maybe it will not crash :sign0095:
 
Last edited:
Upvote 0

thedesolatesoul

Expert
Licensed User
Longtime User
I know the slowest conceivable way to do it via TextReader probably.

Start reading the file, on the first 'not seen before' word, go through the remaining file looking for occurences of this word, and write the count to a variable or a file (accessed via TextReader/Writer again).

TextReader/Writer can solve issues running out of memory, but it will be slow. Then optimizations can further be made by writing out a temporary file, removing already counted words to make the input file smaller and smaller.
 
Upvote 0

thedesolatesoul

Expert
Licensed User
Longtime User
By reading through my count textfile(where I store all my counts) using TextReader. I will have to re-read on each new word i encounter. This file will hopefully be a lot smaller, but ofcourse if the input contains only unique words then the count outfile will be the same size as the input :eek:
 
Upvote 0

mc73

Well-Known Member
Licensed User
Longtime User
I think I can easily say its the slowest crash-free solution :)

the crash-free was what I had in mind, when posted the code above, while at the same time, avoiding searching in a probable big list everytime, by decaring a 3d array, which of course could be 4d or arbitrary :)
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
The problem is that we cannot treat all words at the same time. So we need to somehow divide the words to smaller groups. Each word must be part of exactly one group.

The way I solve it is by calculating the hash of each word modulo <number of groups>.
The hash function returns a number from the string. The hash function is expected to distribute the strings more or less evenly. This means that the size of each group will be more or less equal.

This is the most important code (there are total of 30 groups):
B4X:
If h Mod 30 = Group Then
 'handle word

I assume that more than 10mb of words can be handled in memory so I chose the number of groups to be 30 (the file size is 300mb).

B4X:
Sub Globals
   Dim words As Map
End Sub

Sub Activity_Create(FirstTime As Boolean)
   For i = 0 To 29
      Dim tr As TextReader = OpenFile
      CountWords(i, tr)
      tr.Close
      PrintMap
      words.Clear 'without clearing this map we will get an OutOfMemory exception
   Next
End Sub

Sub OpenFile As TextReader
   Dim In As InputStream
   In = File.OpenInput(...)
   Dim tr As TextReader
   tr.Initialize(In)
   Return tr
End Sub

Sub PrintMap
   For i = 0 To words.Size - 1
      Log(words.GetKeyAt(i) & ":" & words.GetValueAt(i))
   Next
End Sub

Sub CountWords(Group As Int, tr As TextReader)
   Dim word As String = tr.ReadLine
   Do While word <> Null
      Dim h As Int = HashCode(word)
      If h Mod 30 = Group Then '<-- Only treat words that belong to the current group
         Dim count As Int = 0
         If words.ContainsKey(word) Then count = words.Get(word)
         words.Put(word, count + 1)
      End If
      word = tr.ReadLine
   Loop
End Sub

Sub HashCode(s As String) As Int
   Dim h As Int = 0
   For i = 0 To s.Length - 1
      h = 31 * h + Asc(s.CharAt(i))
   Next
   Return Abs(h)
End Sub
 
Upvote 0
Top