µ.¸. Text Segmentation (
Interpunctio Verborum
)
PRIMVSDIGNITASINTAMTENVISCIENTIANONPOTEST
ESSERESENIMSVNTPARVAEPROPEINSINGVLISLITTERIS
ATQVEINTERPVNCTIONIBUSVERBORVMOCCVPATAE
A fluent Latin reader would parse this string (in modern orthography) as
Primus
dignitas in tam tenui scientia non potest esse; res enim sunt parvae, prope in singulis
litteris atque interpunctionibus verborum occupatae
.
µ
Text segmentation is not
only a problem in classical Latin and Greek, but in several modern languages
and scripts including Balinese, Burmese, Chinese, Japanese, Javanese, Khmer,
Lao, Thai, Tibetan, and Vietnamese.
Similar problems arise in segmenting
unpunctuated English text into sentences,
±·
segmenting text into lines for
typesetting, speech and handwriting recognition, curve simplification, and
several types of time-series analysis. For purposes of illustration, I’ll stick to
segmenting sequences of letters in the modern English alphabet into modern
English words.
Of course, some strings can be segmented in several different ways; for
example,
BOTHEARTHANDSATURNSPIN
can be decomposed into English words
as either
BOTH
·
EARTH
·
AND
·
SATURN
·
SPIN
or
BOT
·
HEART
·
HANDS
·
AT
·
URNS
·
PIN
,
among many other possibilities. For now, let’s consider an extremely simple
segmentation problem: Given a string of characters, can it be segmented into
English words
at all
?
To make the problem concrete (and language-agnostic), let’s assume we
have access to a subroutine
I·W»¸¼
(
w
)
that takes a string
w
as input, and
returns
T¸µ´
if
w
is a “word”, or
F²±·´
if
w
is not a “word”. For example, if
we are trying to decompose the input string into palindromes, then a “word”
is a synonym for “palindrome”, and therefore
I·W»¸¼
(
ROTATOR
)=
T¸µ´
but
I·W»¸¼
(
PALINDROME
)=
F²±·´
.
Just like the
Sµ½·´¾Sµº
problem, the
input
structure is a sequence, this
time containing letters instead of numbers, so it is natural to consider a decision
process that consumes the input characters in order from left to right. Similarly,
the
output
structure is a sequence of words, so it is natural to consider a process
that produces the output words in order from left to right. Thus, jumping into
the middle of the segmentation process, we might imagine the following picture:
BLUE
STEM
UNIT
ROBOT
HEARTHANDSATURNSPIN
µ
Loosely translated: “First of all, dignity in such paltry knowledge is impossible; this is trivial
stuff, mostly concerned with individual letters and the placement of points between words.”
Cicero was openly mocking the legal expertise of his friend(!) and noted jurist Servius Sulpicius
Rufus, who had accused Murena of bribery, after Murena defeated Rufus in election for consul.
Murena was acquitted, thanks in part to Cicero’s acerbic defense, although he was almost certainly
guilty.
#librapondo #nunquamest¶delis
±·
St. Augustine’s
De doctrina Christiana
devotes an entire chapter to removing ambiguity from
Latin scripture by adding punctuation.