Letter Frequency
Moderators: FourthWorld, heatherlaine, Klaus, kevinmiller
-
- Livecode Opensource Backer
- Posts: 9376
- Joined: Fri Feb 19, 2010 10:17 am
- Location: Bulgaria
Letter Frequency
The first step in cryptography, if you know the language the original text was written in, is to analyse a text for letter frequency.
Improved stack available lower down the page!
Obviously it is a relatively simple thing to adapt this stack for anyone whose language uses either
an alphabet or an abugida.
Improved stack available lower down the page!
Obviously it is a relatively simple thing to adapt this stack for anyone whose language uses either
an alphabet or an abugida.
Last edited by richmond62 on Tue Mar 06, 2018 12:10 pm, edited 2 times in total.
-
- VIP Livecode Opensource Backer
- Posts: 9833
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Letter Frequency
When I was a kid I was taught that the letter most commonly used in English were, in order: E, T, A, O, N, S, H, I, with a long tail from there. We used to crack Caesar Ciphers with that. Comforting to see the screen shot of that sample kinda bears that out.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
-
- VIP Livecode Opensource Backer
- Posts: 9655
- Joined: Wed May 06, 2009 2:28 pm
- Location: New York, NY
Re: Letter Frequency
Whatever happened to "E T A O I N S H R D L U"?
Or was this just the way early linotype machines dumped their letters/
Craig
Or was this just the way early linotype machines dumped their letters/
Craig
-
- Livecode Opensource Backer
- Posts: 9376
- Joined: Fri Feb 19, 2010 10:17 am
- Location: Bulgaria
Re: Letter Frequency
I have used a relatively short text.
I suspect to get an accurate reflection of letter frequency in any language you'd
have to analyse quite a few long texts of differing genres.
I suspect to get an accurate reflection of letter frequency in any language you'd
have to analyse quite a few long texts of differing genres.
-
- VIP Livecode Opensource Backer
- Posts: 9833
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Letter Frequency
Craig, it may be that the book I was learning from was very old. Language changes over time, and maybe that affects letter frequency. Or maybe I'm just so old I forgot the details. Either way, we were both close enough to the results of that sample that we'd be able to sort our way through a Caesar cipher version of it.
For a more accurate count maybe we can talk Richmond into running the algo against the Enron corpus:
https://www.cs.cmu.edu/~enron/
The tar file is only 443 MBs - shouldn't take too long.
For a more accurate count maybe we can talk Richmond into running the algo against the Enron corpus:
https://www.cs.cmu.edu/~enron/
The tar file is only 443 MBs - shouldn't take too long.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
-
- Livecode Opensource Backer
- Posts: 9376
- Joined: Fri Feb 19, 2010 10:17 am
- Location: Bulgaria
Re: Letter Frequency
Quite.we can talk Richmond into running the algo against the Enron corpus
Of course you could download my stack and do that yourself
Re: Letter Frequency
My (ancient) codes and ciphers book said ETAONRISH...
I guess letters fall out of popularity like names.
I would guess the content of the Enron file might be skewed towards business terminology and could (probably only very slightly) affect the letter count compared to the language as a whole. Plus U would be at a disadvantage as it would be left out of words like colour... Oh no! English isn't English any more!
I guess letters fall out of popularity like names.
I would guess the content of the Enron file might be skewed towards business terminology and could (probably only very slightly) affect the letter count compared to the language as a whole. Plus U would be at a disadvantage as it would be left out of words like colour... Oh no! English isn't English any more!
-
- Livecode Opensource Backer
- Posts: 9376
- Joined: Fri Feb 19, 2010 10:17 am
- Location: Bulgaria
Re: Letter Frequency
Analysing the text of a book written in the 1890s about Linguistics I came across "nifty" words
such as 'ni**er', 'd*rk*e', 'peasant', 'primitive' and 'uncivilised' . . .
While those words may be a bit dangerous to use because the Thought Police and the Politically Correct Lefty meddlers might "drum you out of the Br*wnies", analysing samples that contain words of that source STILL will
yield an idea of letter-frequency in 19th century English written texts.
As a person who has "chromatically challenged" hair (it's Ginger), I have absolutely no time for "tip-toeing through the tulips" when it comes to language: a 'spade' is a 'spade' (despite possible ambiguities) and NOT
an 'earth relocation device'.
NOW: go away and design a LiveCode stack to check people's texts for Politically Incorrect words and phrases, including an in-built updater so it is always ahead of the moving fence.
such as 'ni**er', 'd*rk*e', 'peasant', 'primitive' and 'uncivilised' . . .
While those words may be a bit dangerous to use because the Thought Police and the Politically Correct Lefty meddlers might "drum you out of the Br*wnies", analysing samples that contain words of that source STILL will
yield an idea of letter-frequency in 19th century English written texts.
As a person who has "chromatically challenged" hair (it's Ginger), I have absolutely no time for "tip-toeing through the tulips" when it comes to language: a 'spade' is a 'spade' (despite possible ambiguities) and NOT
an 'earth relocation device'.
NOW: go away and design a LiveCode stack to check people's texts for Politically Incorrect words and phrases, including an in-built updater so it is always ahead of the moving fence.
-
- Livecode Opensource Backer
- Posts: 9376
- Joined: Fri Feb 19, 2010 10:17 am
- Location: Bulgaria
Re: Letter Frequency
Love Gift for those
who want to muck around with large corpora.
Completely OT: my Book of the Moment is 'The Book of Bebb' which I thoroughly recommend.
Just added the above as a
who want to muck around with large corpora.
Completely OT: my Book of the Moment is 'The Book of Bebb' which I thoroughly recommend.
-
- Livecode Opensource Backer
- Posts: 9376
- Joined: Fri Feb 19, 2010 10:17 am
- Location: Bulgaria
Re: Letter Frequency
Funny remarks about the ENRON dataset aside, this:
all the constituent messages into one long text . . .
. . . a bit of a pain.
Although, from the point of view of LiveCode that wouldn't be a problem as such.
What would be a problem would be "jumping in and out" of all the folders to load the
text files into a text field in a stack.
Means that, having downloaded the dataset one would have to concatenateIt contains data from about 150 users, mostly senior management of Enron, organized into folders.
all the constituent messages into one long text . . .
. . . a bit of a pain.
Although, from the point of view of LiveCode that wouldn't be a problem as such.
What would be a problem would be "jumping in and out" of all the folders to load the
text files into a text field in a stack.
-
- Livecode Opensource Backer
- Posts: 9376
- Joined: Fri Feb 19, 2010 10:17 am
- Location: Bulgaria
Re: Letter Frequency
Frankly, it might be better to work with corpora such as these:
http://www.natcorp.ox.ac.uk/
https://corpus.byu.edu/
even in most of the cases they list these are NOT straightforward files containing texts,
they have all sorts of "guff" such as POS-tags embedded in them.
Yikes!
http://www.natcorp.ox.ac.uk/
https://corpus.byu.edu/
even in most of the cases they list these are NOT straightforward files containing texts,
they have all sorts of "guff" such as POS-tags embedded in them.
Yikes!
-
- VIP Livecode Opensource Backer
- Posts: 9833
- Joined: Sat Apr 08, 2006 7:05 am
- Location: Los Angeles
- Contact:
Re: Letter Frequency
Given the size of the corpus it would be far more efficient to just traverse the folders and process each individually.richmond62 wrote: ↑Tue Mar 06, 2018 12:36 pmFunny remarks about the ENRON dataset aside, this:
Means that, having downloaded the dataset one would have to concatenateIt contains data from about 150 users, mostly senior management of Enron, organized into folders.
all the constituent messages into one long text .
Besides, being email, unless you were doing a relationship study you'd probably want to write a filter to remove the header from each.
Richard Gaskin
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
LiveCode development, training, and consulting services: Fourth World Systems
LiveCode Group on Facebook
LiveCode Group on LinkedIn
-
- Livecode Opensource Backer
- Posts: 9376
- Joined: Fri Feb 19, 2010 10:17 am
- Location: Bulgaria
Re: Letter Frequency
Err . . . I'm currently working on getting Bulgarian kids between 6 and 8 years old to writea relationship study
English in a vaguely comprehensible hand.