Letter Frequency

Got a LiveCode personal license? Are you a beginner, hobbyist or educator that's new to LiveCode? This forum is the place to go for help getting started. Welcome!

Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller

Post Reply
richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 9376
Joined: Fri Feb 19, 2010 10:17 am
Location: Bulgaria

Letter Frequency

Post by richmond62 » Mon Mar 05, 2018 9:52 pm

The first step in cryptography, if you know the language the original text was written in, is to analyse a text for letter frequency.
LCount.png
Improved stack available lower down the page!

Obviously it is a relatively simple thing to adapt this stack for anyone whose language uses either
an alphabet or an abugida.
Last edited by richmond62 on Tue Mar 06, 2018 12:10 pm, edited 2 times in total.

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9833
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Letter Frequency

Post by FourthWorld » Tue Mar 06, 2018 12:38 am

When I was a kid I was taught that the letter most commonly used in English were, in order: E, T, A, O, N, S, H, I, with a long tail from there. We used to crack Caesar Ciphers with that. Comforting to see the screen shot of that sample kinda bears that out.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

dunbarx
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9655
Joined: Wed May 06, 2009 2:28 pm
Location: New York, NY

Re: Letter Frequency

Post by dunbarx » Tue Mar 06, 2018 2:42 am

Whatever happened to "E T A O I N S H R D L U"?

Or was this just the way early linotype machines dumped their letters/

Craig

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 9376
Joined: Fri Feb 19, 2010 10:17 am
Location: Bulgaria

Re: Letter Frequency

Post by richmond62 » Tue Mar 06, 2018 7:19 am

I have used a relatively short text.

I suspect to get an accurate reflection of letter frequency in any language you'd
have to analyse quite a few long texts of differing genres.

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9833
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Letter Frequency

Post by FourthWorld » Tue Mar 06, 2018 7:49 am

Craig, it may be that the book I was learning from was very old. Language changes over time, and maybe that affects letter frequency. Or maybe I'm just so old I forgot the details. :) Either way, we were both close enough to the results of that sample that we'd be able to sort our way through a Caesar cipher version of it.

For a more accurate count maybe we can talk Richmond into running the algo against the Enron corpus:
https://www.cs.cmu.edu/~enron/

The tar file is only 443 MBs - shouldn't take too long. ;)
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 9376
Joined: Fri Feb 19, 2010 10:17 am
Location: Bulgaria

Re: Letter Frequency

Post by richmond62 » Tue Mar 06, 2018 8:18 am

we can talk Richmond into running the algo against the Enron corpus
Quite.

Of course you could download my stack and do that yourself 8)

SparkOut
Posts: 2852
Joined: Sun Sep 23, 2007 4:58 pm

Re: Letter Frequency

Post by SparkOut » Tue Mar 06, 2018 8:19 am

My (ancient) codes and ciphers book said ETAONRISH...
I guess letters fall out of popularity like names. :D

I would guess the content of the Enron file might be skewed towards business terminology and could (probably only very slightly) affect the letter count compared to the language as a whole. Plus U would be at a disadvantage as it would be left out of words like colour... Oh no! English isn't English any more! :P

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 9376
Joined: Fri Feb 19, 2010 10:17 am
Location: Bulgaria

Re: Letter Frequency

Post by richmond62 » Tue Mar 06, 2018 9:34 am

Analysing the text of a book written in the 1890s about Linguistics I came across "nifty" words
such as 'ni**er', 'd*rk*e', 'peasant', 'primitive' and 'uncivilised' . . .

While those words may be a bit dangerous to use because the Thought Police and the Politically Correct Lefty meddlers might "drum you out of the Br*wnies", analysing samples that contain words of that source STILL will
yield an idea of letter-frequency in 19th century English written texts.

As a person who has "chromatically challenged" hair (it's Ginger), I have absolutely no time for "tip-toeing through the tulips" when it comes to language: a 'spade' is a 'spade' (despite possible ambiguities) and NOT
an 'earth relocation device'.

NOW: go away and design a LiveCode stack to check people's texts for Politically Incorrect words and phrases, including an in-built updater so it is always ahead of the moving fence. :twisted:

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 9376
Joined: Fri Feb 19, 2010 10:17 am
Location: Bulgaria

Re: Letter Frequency

Post by richmond62 » Tue Mar 06, 2018 12:11 pm

importBTN.png
importBTN.png (9.04 KiB) Viewed 5810 times
Just added the above as a Love Gift for those
who want to muck around with large corpora.
Letter Counter.livecode.zip
(14.63 KiB) Downloaded 149 times
Completely OT: my Book of the Moment is 'The Book of Bebb' which I thoroughly recommend.
Bebb.jpg
Bebb.jpg (21.19 KiB) Viewed 5810 times

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 9376
Joined: Fri Feb 19, 2010 10:17 am
Location: Bulgaria

Re: Letter Frequency

Post by richmond62 » Tue Mar 06, 2018 12:36 pm

Funny remarks about the ENRON dataset aside, this:
It contains data from about 150 users, mostly senior management of Enron, organized into folders.
Means that, having downloaded the dataset one would have to concatenate
all the constituent messages into one long text . . .

. . . a bit of a pain. 8)

Although, from the point of view of LiveCode that wouldn't be a problem as such.

What would be a problem would be "jumping in and out" of all the folders to load the
text files into a text field in a stack.

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 9376
Joined: Fri Feb 19, 2010 10:17 am
Location: Bulgaria

Re: Letter Frequency

Post by richmond62 » Tue Mar 06, 2018 12:45 pm

Frankly, it might be better to work with corpora such as these:

http://www.natcorp.ox.ac.uk/

https://corpus.byu.edu/

even in most of the cases they list these are NOT straightforward files containing texts,
they have all sorts of "guff" such as POS-tags embedded in them.
xml.png
xml.png (3.39 KiB) Viewed 5801 times
Yikes!

FourthWorld
VIP Livecode Opensource Backer
VIP Livecode Opensource Backer
Posts: 9833
Joined: Sat Apr 08, 2006 7:05 am
Location: Los Angeles
Contact:

Re: Letter Frequency

Post by FourthWorld » Tue Mar 06, 2018 3:52 pm

richmond62 wrote:
Tue Mar 06, 2018 12:36 pm
Funny remarks about the ENRON dataset aside, this:
It contains data from about 150 users, mostly senior management of Enron, organized into folders.
Means that, having downloaded the dataset one would have to concatenate
all the constituent messages into one long text .
Given the size of the corpus it would be far more efficient to just traverse the folders and process each individually.

Besides, being email, unless you were doing a relationship study you'd probably want to write a filter to remove the header from each.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn

richmond62
Livecode Opensource Backer
Livecode Opensource Backer
Posts: 9376
Joined: Fri Feb 19, 2010 10:17 am
Location: Bulgaria

Re: Letter Frequency

Post by richmond62 » Tue Mar 06, 2018 6:03 pm

a relationship study
Err . . . I'm currently working on getting Bulgarian kids between 6 and 8 years old to write
English in a vaguely comprehensible hand.
RHline.png

Post Reply

Return to “Getting Started with LiveCode - Complete Beginners”