---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large list
of unsorted words possibly containing duplicates - extracts 26 sets of
100 random and unique words that each begin with a letter of the English alphabet.
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large list
of unsorted words possibly containing duplicates - extracts 26 sets of
100 random and unique words that each begin with a letter of the English alphabet.
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large list
of unsorted words possibly containing duplicates - extracts 26 sets of
100 random and unique words that each begin with a letter of the English
alphabet.
What random distribution, uniform?
Said distribution over the unique words or said distribution over the original list?
pseudorandom?
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large list
of unsorted words possibly containing duplicates - extracts 26 sets of
100 random and unique words that each begin with a letter of the English
alphabet.
By "extracts" do you mean to imply that instances of words are selected
with removal from the population rather than being returned for the
following selection events?
3) print the 2600 words you identify in column x row order in a grid of
You must call a RNG 2600+ times to build the list
ie you can't use the
random ordering of the input file to your advantage).˙
On 22/03/2026 14:38, DFS wrote:
You must call a RNG 2600+ times to build the list
ie you can't use the
random ordering of the input file to your advantage).
The two are not the same, that is, the use of "ie" is wrong.
Which do you really require, or do you really require I satisfy the conjunction of the two?
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large list
of unsorted words possibly containing duplicates - extracts 26 sets of
100 random and unique words that each begin with a letter of the English alphabet.
On 3/22/2026 7:02 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large list >>> of unsorted words possibly containing duplicates - extracts 26 sets of
100 random and unique words that each begin with a letter of the English >>> alphabet.
What random distribution, uniform?
Said distribution over the unique words or said distribution over the
original list?
pseudorandom?
I don't care about the uniformity of the distribution, as long as the
output is unique words, and you generate and use 2600+ random values
from a RNG.
On 3/22/2026 7:21 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
You must call a RNG 2600+ times to build the list
ie you can't use the
random ordering of the input file to your advantage).
The two are not the same, that is, the use of "ie" is wrong.
I never said they were the same.
On 3/22/2026 7:21 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
You must call a RNG 2600+ times to build the list
ie you can't use the
random ordering of the input file to your advantage).
The two are not the same, that is, the use of "ie" is wrong.
I never said they were the same.
On 3/22/2026 1:29 PM, John McCue wrote:
DFS <nospam@dfs.com> wrote:
<snip>
---------------------
Word Source
---------------------
There's a huge unsorted word list here:
https://limewire.com/?referrer=pq7i8xx7p2
...which you can develop against.
Do I need to create an ID to get the list ?
I don't think so.
It didn't give me an ID or login when I uploaded them.
I just now uploaded it here: https://filebin.net/kkkyqw1ritefnw0f
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large
list of unsorted words possibly containing duplicates - extracts 26
sets of 100 random and unique words that each begin with a letter of
the English alphabet.
I've had a first go at this challenge, using a scripting language to get something working as a reference.
Not C, so that code is here: https://github.com/sal55/langs/blob/master/dfs.q
The output it produced is this: https://github.com/sal55/langs/blob/master/output. (I think it is
missing the heading for challenge 3.) It took 0.35 seconds to write that file.
I haven't looked at your version in detail but did notice thehave to
line-counts (as I had to delete those lines for a previous reply).
Any solution I come up with in C (which may take a while!) will
use entirely different methods. I'm not interested in writing
hash-tables etc in C, I'm far too lazy. Probably it will be much longer
than yours.
One thing which is still not clear is how to choose the layout of the
final challenge. I assume the number of rows has to be a multiple of
100, but how to decide the columns?
I went with 3 columns max as the most practical.
(Probably my version will go wrong if there aren't at least 100 words
per letter in the input.)
On 22/03/2026 23:14, DFS wrote:
On 3/22/2026 7:02 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large list >>>> of unsorted words possibly containing duplicates - extracts 26 sets of >>>> 100 random and unique words that each begin with a letter of the English >>>> alphabet.
What random distribution, uniform?
Said distribution over the unique words or said distribution over the
original list?
pseudorandom?
I don't care about the uniformity of the distribution, as long as the
output is unique words, and you generate and use 2600+ random values
from a RNG.
I think you're unaware that I can predictably generate a sequence of identical values when the distribution is free and your specification is satisfied by selecting with a distribution that prefers just one
indicatory value for a choice of word to the exclusion of all others.
You mention an RNG, I suppose then that you exclude pseudo-random
numbers because those are normally referred to as PRNGs and I understand
that RNG excludes them.
On 22/03/2026 14:38, DFS wrote:
You must call a RNG 2600+ times to build the list
ie you can't use the
random ordering of the input file to your advantage).?
The two are not the same, that is, the use of "ie" is wrong.
Which do you really require, or do you really require I satisfy the conjunction of the two?
Do you try to hint that challenges with seemingly arbitrary rules and seemingly arbitrary purposes are not very worthy?
On 3/22/2026 7:53 PM, Bart wrote:
Not C, so that code is here: https://github.com/sal55/langs/blob/
master/dfs.q
slick.˙ It's a powerful scripting language.˙ Reading a text file in with
one line is nice.˙ It's about 10 lines of C.
Did you look to python for inspiration when creating it?
Looks like line 16 is where you call a randomizer.˙ If you put a counter
at line 17 what does it say after the program is run?
Is bounds a property of your list objects?
Is bounds a pair of numbers 0..length of list-1?
What generates your random values?
have toAny solution I come up with in C (which may take a while!) will
use entirely different methods. I'm not interested in writing hash-
tables etc in C, I'm far too lazy. Probably it will be much longer
than yours.
You have to deliver C to get a chance at the prize.
arbitrary purposes > not very worthy
On 3/22/2026 7:53 PM, Bart wrote:
I haven't looked at your version in detail but did notice thehave to
line-counts (as I had to delete those lines for a previous reply).
Any solution I come up with in C (which may take a while!) will
use entirely different methods. I'm not interested in writing hash-
tables etc in C, I'm far too lazy. Probably it will be much longer
than yours.
You have to deliver C to get a chance at the prize.
And I like to see different approaches.˙ The way I did it in C and
Python is similar, but Python makes it SO easy (one-line) to segregate
words by letter that I took the easy way out there.
On 3/22/2026 1:29 PM, John McCue wrote:
DFS <nospam@dfs.com> wrote:
<snip>
---------------------
Word Source
---------------------
There's a huge unsorted word list here:
https://limewire.com/?referrer=pq7i8xx7p2
...which you can develop against.
Do I need to create an ID to get the list ?
I don't think so.
It didn't give me an ID or login when I uploaded them.
I just now uploaded it here: https://filebin.net/kkkyqw1ritefnw0f
DFS <nospam@dfs.com> writes:
On 3/22/2026 1:29 PM, John McCue wrote:
DFS <nospam@dfs.com> wrote:
<snip>
---------------------
Word Source
---------------------
There's a huge unsorted word list here:
https://limewire.com/?referrer=pq7i8xx7p2
...which you can develop against.
Do I need to create an ID to get the list ?
I don't think so.
It didn't give me an ID or login when I uploaded them.
I just now uploaded it here: https://filebin.net/kkkyqw1ritefnw0f
A fucking web page. How about a link to a plain text file
that has just the words?
On 3/22/2026 8:05 PM, Tristan Wibberley wrote:
On 22/03/2026 23:14, DFS wrote:
On 3/22/2026 7:02 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large
list
of unsorted words possibly containing duplicates - extracts 26 sets of >>>>> 100 random and unique words that each begin with a letter of the
English
alphabet.
What random distribution, uniform?
Said distribution over the unique words or said distribution over the
original list?
pseudorandom?
I don't care about the uniformity of the distribution, as long as the
output is unique words, and you generate and use 2600+ random values
from a RNG.
I think you're unaware that I can predictably generate a sequence of
identical values when the distribution is free and your specification is
satisfied by selecting with a distribution that prefers just one
indicatory value for a choice of word to the exclusion of all others.
Yeah, I don't really know what any of that means.˙ But it sounds like
your 3rd attempt to sidestep the generation and use of 2600+ random values.
I think you could show some interesting techniques, but you have to
adhere to the requirements of the challenge.
On 3/23/2026 4:53 AM, Michael S wrote:
Do you try to hint that challenges with seemingly arbitrary rules and
seemingly arbitrary purposes are not very worthy?
Arbitrary and worth are in the eyes of the beholder.
So keep your arbitrary, worthless opinions to yourself.
On 23/03/2026 04:06, DFS wrote:
On 3/22/2026 8:05 PM, Tristan Wibberley wrote:
On 22/03/2026 23:14, DFS wrote:
On 3/22/2026 7:02 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large >>>>>> list
of unsorted words possibly containing duplicates - extracts 26 sets of >>>>>> 100 random and unique words that each begin with a letter of the
English
alphabet.
What random distribution, uniform?
Said distribution over the unique words or said distribution over the >>>>> original list?
pseudorandom?
I don't care about the uniformity of the distribution, as long as the
output is unique words, and you generate and use 2600+ random values
from a RNG.
I think you're unaware that I can predictably generate a sequence of
identical values when the distribution is free and your specification is >>> satisfied by selecting with a distribution that prefers just one
indicatory value for a choice of word to the exclusion of all others.
Yeah, I don't really know what any of that means.˙ But it sounds like
your 3rd attempt to sidestep the generation and use of 2600+ random values. >>
I think you could show some interesting techniques, but you have to
adhere to the requirements of the challenge.
It's because of my deeper understanding of the meaning (or barely meaningfulness) of the word "random" and my awareness of how critical it
is to many applications of randomness.
That is:
- If it's a game I could go ahead with a PRNG and satisfy you easily -
but it's not interesting to me these days, at this point I think it's a
game,
- If it's a secure application of choice that leaks /no/ information
about the input list beyond the fact of the achievement of the
lower-bound, respectively, on the number of words having each initial, I
can tighten your specification in some of the ways I've queried,
- If it's a secure application of choice that may leak some information about the input list but may not leak any of the information of the
original ordering of the words (which was implied to be the reason for
the minimum number of queries for random numbers) then I can use fewer
bits of entropy, saving runtime costs. This 3rd possible endeavour is
less relevant now that you allow a PRNG because I don't have to care
about the cost of bits of entropy or their turnaround time. It's still
an interesting one when considering the nature of the task of
requirements engineering and agreeing requirements. Programmes can fail
due to uncompetitiveness induced by individual member projects with unnecessary or insufficient requirements.
More than that, though, queries for random numbers may come in
individual bits and an implementation might query 16 numbers, for
example, for each word choice, rather than one. And it goes deeper than
that. That means the request to query for 2600 numbers is sort of
meaningless and can lead to programme failure by being in the class of unnecessary itself and its presence leading to the other requirements
being insufficient.
More still, people get gambling games wrong and go to jail because they
learn and practice C programming without any awareness of the difficulty
of "random" and they might read this newsgroup to shape their skills.
So you see, each of my questions were properly important for many reasons.
I really did think about it carefully.
On 3/24/2026 3:43 AM, Tim Rentsch wrote:
DFS <nospam@dfs.com> writes:
On 3/22/2026 1:29 PM, John McCue wrote:
DFS <nospam@dfs.com> wrote:
<snip>
---------------------
Word Source
---------------------
There's a huge unsorted word list here:
https://limewire.com/?referrer=pq7i8xx7p2
...which you can develop against.
Do I need to create an ID to get the list ?
I don't think so.
It didn't give me an ID or login when I uploaded them.
I just now uploaded it here: https://filebin.net/kkkyqw1ritefnw0f
A fucking web page. How about a link to a plain text file
that has just the words?
Just fucking click on the fucking file name.
DFS <nospam@dfs.com> writes:
On 3/24/2026 3:43 AM, Tim Rentsch wrote:
DFS <nospam@dfs.com> writes:
On 3/22/2026 1:29 PM, John McCue wrote:
DFS <nospam@dfs.com> wrote:
<snip>
---------------------
Word Source
---------------------
There's a huge unsorted word list here:
https://limewire.com/?referrer=pq7i8xx7p2
...which you can develop against.
Do I need to create an ID to get the list ?
I don't think so.
It didn't give me an ID or login when I uploaded them.
I just now uploaded it here: https://filebin.net/kkkyqw1ritefnw0f
A fucking web page. How about a link to a plain text file
that has just the words?
Just fucking click on the fucking file name.
Not everyone reads usenet with a browser or a news client
that understands hypertext or the hypertext transfer protocol.
I would generally have used 'wget' to fetch, so if you'd
specified:
https://filebin.net/kkkyqw1ritefnw0f/words_unsorted.txt
That may have been slightly better, but it appears that
filebin interposes a warning screen and forces a second
click, so wget may also have failed.
On 23/03/2026 03:40, DFS wrote:
On 3/22/2026 7:53 PM, Bart wrote:
I haven't looked at your version in detail but did notice thehave to
line-counts (as I had to delete those lines for a previous reply).
Any solution I come up with in C (which may take a while!) will
use entirely different methods. I'm not interested in writing hash-
tables etc in C, I'm far too lazy. Probably it will be much longer
than yours.
You have to deliver C to get a chance at the prize.
And I like to see different approaches.˙ The way I did it in C and
Python is similar, but Python makes it SO easy (one-line) to segregate
words by letter that I took the easy way out there.
I now have a C version, a bit long to post so is at this link:
˙https://github.com/sal55/langs/blob/master/dfs.c
It looks very clunky but seems to do the job, and not too slowly either
(see below).
I then tried yours, which is somewhat shorter (160 lines vs my 205
lines, which includes blanks etc).
However, that doesn't seem to do part (2) of the challenge. While that doesn't explicity say the unsorted duplicates must be shown, that's what
the example does:
˙˙ found:˙ eventually dupes you get
˙˙ output: Dupes Eventually Get You
Your C program (I see the Python does it too) shows the equivalent of this:
˙ Duplicate words in proper case
˙ Dupes Eventually Get You
Now, I noticed that my original M version displayed that first 'found'
like, but the words were sorted, not unsorted! Displaying the original
order involved quite a bit of extra work,
and an extra copy of the > word-list. The method is also inefficient.
So, is that necessary, or not? If not then I can simplify my versions.
Anyway, my C version does absolutely nothing clever. Everything is a
linear search.
The only hi-tech bit is the quicksort routine.
Timing, all run under Windows:
˙ My C:˙˙˙˙˙˙˙˙ 0.30 seconds
˙ Your C:˙˙˙˙˙˙ 0.25 seconds
˙ My Q:˙˙˙˙˙˙˙˙ 0.34 seconds
˙ Your Python:˙ 1.77 seconds (CPython)
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ 0.88 seconds (PyPy)
The C timings are unoptimised; optimising might knock off 0.01 or 0.02 seconds.
I don't know why the Python timing is slow, especially given that its
sort() routine will be internal native code function, and mine runs as bytecode.
My interpreters generally are faster than CPython at executing bytecode,
but with tasks like this, most time is usually spent within internal
native code libraries.
On 23/03/2026 03:40, DFS wrote:
On 3/22/2026 7:53 PM, Bart wrote:
Not C, so that code is here: https://github.com/sal55/langs/blob/
master/dfs.q
slick.˙ It's a powerful scripting language.˙ Reading a text file in
with one line is nice.˙ It's about 10 lines of C.
Well, it can be one line in C too, once you create a function for it!
Did you look to python for inspiration when creating it?
No. I glanced at it but all I remember is that it was 58 lines.
Looks like line 16 is where you call a randomizer.˙ If you put a
counter at line 17 what does it say after the program is run?
It's called 2631 times. With a different seed, it will vary.
Is bounds a property of your list objects?
Is bounds a pair of numbers 0..length of list-1?
Yes, but the bounds usually start from 1. And here, the 'long' and
short' lists have bounds of 'a' to 'z' (97 to 122).
What generates your random values?
I use the PRNG shown below (not˙ C code, and not mine).
There are a couple of levels of functions on top. The range-based
'random()' in the scripting language probably gives slightly biased
results, but none of my stuff including this is critical.
have toAny solution I come up with in C (which may take a while!) will
use entirely different methods. I'm not interested in writing hash-
tables etc in C, I'm far too lazy. Probably it will be much longer
than yours.
You have to deliver C to get a chance at the prize.
I decided to do it in my 'M' language first as there are fewer i's and
t's to dot and cross when developing an algorithm.
That part's been done, now all that remains is manual porting to C. I
will do that later. (Auto-transpiling to C works, but I guess that's not
the kind of C you want.)
(If interested, my version is here; it's about 160 lines: https://github.com/sal55/langs/blob/master/dfs.m. I had planned to use
C's qsort(), but that didn't seem to work, so it includes a sort routine.)
This version produces the output in 0.30 seconds.
BTW the challenge has proved useful as it showed up bugs in both my scripting language and the compiled one. The first has been fixed, the second will be; I used the previous compiler version to test the code.
---------------------
[2]int seed = (0x2989'8811'1111'1272',0x1673'2673'7335'8264)
export func mrandom:u64 =
˙˙˙ int x, y
˙˙˙ x := seed[1]
˙˙˙ y := seed[2]
˙˙˙ seed[1] := y
˙˙˙ x ixor:= x<<23
˙˙˙ seed[2] := x ixor y ixor x>>17 ixor y>>26
˙˙˙ return seed[2] + y
end
On 23/03/2026 04:06, DFS wrote:
On 3/22/2026 8:05 PM, Tristan Wibberley wrote:
On 22/03/2026 23:14, DFS wrote:
On 3/22/2026 7:02 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large >>>>>> list
of unsorted words possibly containing duplicates - extracts 26 sets of >>>>>> 100 random and unique words that each begin with a letter of the
English
alphabet.
What random distribution, uniform?
Said distribution over the unique words or said distribution over the >>>>> original list?
pseudorandom?
I don't care about the uniformity of the distribution, as long as the
output is unique words, and you generate and use 2600+ random values
from a RNG.
I think you're unaware that I can predictably generate a sequence of
identical values when the distribution is free and your specification is >>> satisfied by selecting with a distribution that prefers just one
indicatory value for a choice of word to the exclusion of all others.
Yeah, I don't really know what any of that means.˙ But it sounds like
your 3rd attempt to sidestep the generation and use of 2600+ random values. >>
I think you could show some interesting techniques, but you have to
adhere to the requirements of the challenge.
It's because of my deeper understanding of the meaning (or barely meaningfulness) of the word "random" and my awareness of how critical it
is to many applications of randomness.
That is:
- If it's a game I could go ahead with a PRNG and satisfy you easily -
but it's not interesting to me these days, at this point I think it's a
game,
- If it's a secure application of choice that leaks /no/ information
about the input list beyond the fact of the achievement of the
lower-bound, respectively, on the number of words having each initial, I
can tighten your specification in some of the ways I've queried,
- If it's a secure application of choice that may leak some information about the input list but may not leak any of the information of the
original ordering of the words (which was implied to be the reason for
the minimum number of queries for random numbers) then I can use fewer
bits of entropy, saving runtime costs.
This 3rd possible endeavour is
less relevant now that you allow a PRNG because I don't have to care
about the cost of bits of entropy or their turnaround time. It's still
an interesting one when considering the nature of the task of
requirements engineering and agreeing requirements. Programmes can fail
due to uncompetitiveness induced by individual member projects with unnecessary or insufficient requirements.
More than that, though, queries for random numbers may come in
individual bits and an implementation might query 16 numbers, for
example, for each word choice, rather than one. And it goes deeper than
that. That means the request to query for 2600 numbers is sort of
meaningless and can lead to programme failure by being in the class of unnecessary itself and its presence leading to the other requirements
being insufficient.
More still, people get gambling games wrong and go to jail because they
learn and practice C programming without any awareness of the difficulty
of "random" and they might read this newsgroup to shape their skills.
So you see, each of my questions were properly important for many reasons.
I really did think about it carefully.
DFS <nospam@dfs.com> writes:
On 3/24/2026 3:43 AM, Tim Rentsch wrote:
DFS <nospam@dfs.com> writes:
On 3/22/2026 1:29 PM, John McCue wrote:
DFS <nospam@dfs.com> wrote:
<snip>
---------------------
Word Source
---------------------
There's a huge unsorted word list here:
https://limewire.com/?referrer=pq7i8xx7p2
...which you can develop against.
Do I need to create an ID to get the list ?
I don't think so.
It didn't give me an ID or login when I uploaded them.
I just now uploaded it here: https://filebin.net/kkkyqw1ritefnw0f
A fucking web page. How about a link to a plain text file
that has just the words?
Just fucking click on the fucking file name.
Not everyone reads usenet with a browser or a news client
that understands hypertext or the hypertext transfer protocol.
I would generally have used 'wget' to fetch, so if you'd
specified:
https://filebin.net/kkkyqw1ritefnw0f/words_unsorted.txt
That may have been slightly better, but it appears that
filebin interposes a warning screen and forces a second
click, so wget may also have failed.
On 3/23/2026 12:03 PM, Bart wrote:
On 23/03/2026 03:40, DFS wrote:
On 3/22/2026 7:53 PM, Bart wrote:
Not C, so that code is here: https://github.com/sal55/langs/blob/
master/dfs.q
slick.˙ It's a powerful scripting language.˙ Reading a text file in
with one line is nice.˙ It's about 10 lines of C.
Well, it can be one line in C too, once you create a function for it!
Did you look to python for inspiration when creating it?
No. I glanced at it but all I remember is that it was 58 lines.
I don't mean my little bit of code.˙ I mean did you look to the python language for inspirations when you were developing your scripting language?
I decided to do it in my 'M' language first as there are fewer i's and
t's to dot and cross when developing an algorithm.
You have separate M and Q languages?
(If interested, my version is here; it's about 160 lines: https://
github.com/sal55/langs/blob/master/dfs.m. I had planned to use C's
qsort(), but that didn't seem to work, so it includes a sort routine.)
This version produces the output in 0.30 seconds.
Why wouldn't qsort() work?
---------------------
[2]int seed = (0x2989'8811'1111'1272',0x1673'2673'7335'8264)
export func mrandom:u64 =
˙˙˙˙ int x, y
˙˙˙˙ x := seed[1]
˙˙˙˙ y := seed[2]
˙˙˙˙ seed[1] := y
˙˙˙˙ x ixor:= x<<23
˙˙˙˙ seed[2] := x ixor y ixor x>>17 ixor y>>26
˙˙˙˙ return seed[2] + y
end
Do you have a C version of that?
If so I'll run it against a RNG comparison program I wrote.
On 3/24/2026 10:02 AM, Scott Lurndal wrote:
DFS <nospam@dfs.com> writes:
On 3/24/2026 3:43 AM, Tim Rentsch wrote:
DFS <nospam@dfs.com> writes:
On 3/22/2026 1:29 PM, John McCue wrote:
DFS <nospam@dfs.com> wrote:
<snip>
---------------------
Word Source
---------------------
There's a huge unsorted word list here:
https://limewire.com/?referrer=pq7i8xx7p2
...which you can develop against.
Do I need to create an ID to get the list ?
I don't think so.
It didn't give me an ID or login when I uploaded them.
I just now uploaded it here:˙ https://filebin.net/kkkyqw1ritefnw0f
A fucking web page.˙ How about a link to a plain text file
that has just the words?
Just fucking click on the fucking file name.
Not everyone reads usenet with a browser or a news client
that understands hypertext or the hypertext transfer protocol.
Sucks for them.
I would generally have used 'wget' to fetch, so if you'd
specified:
https://filebin.net/kkkyqw1ritefnw0f/words_unsorted.txt
That may have been slightly better, but it appears that
filebin interposes a warning screen and forces a second
click, so wget may also have failed.
$wget -r -np https://filebin.net/kkkyqw1ritefnw0f/words_unsorted.txt
On 3/24/2026 7:03 AM, Tristan Wibberley wrote:
On 23/03/2026 04:06, DFS wrote:
On 3/22/2026 8:05 PM, Tristan Wibberley wrote:
On 22/03/2026 23:14, DFS wrote:
On 3/22/2026 7:02 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large >>>>>>> list
of unsorted words possibly containing duplicates - extracts 26
sets of
100 random and unique words that each begin with a letter of the >>>>>>> English
alphabet.
What random distribution, uniform?
Said distribution over the unique words or said distribution over the >>>>>> original list?
pseudorandom?
I don't care about the uniformity of the distribution, as long as the >>>>> output is unique words, and you generate and use 2600+ random values >>>>> from a RNG.
I think you're unaware that I can predictably generate a sequence of
identical values when the distribution is free and your
specification is
satisfied by selecting with a distribution that prefers just one
indicatory value for a choice of word to the exclusion of all others.
Yeah, I don't really know what any of that means.˙ But it sounds like
your 3rd attempt to sidestep the generation and use of 2600+ random
values.
I think you could show some interesting techniques, but you have to
adhere to the requirements of the challenge.
It's because of my deeper understanding of the meaning (or barely
meaningfulness) of the word "random" and my awareness of how critical it
is to many applications of randomness.
That is:
˙ - If it's a game I could go ahead with a PRNG and satisfy you easily -
but it's not interesting to me these days, at this point I think it's a
game,
A game... now you're onto me.
˙ - If it's a secure application of choice that leaks /no/ information
about the input list beyond the fact of the achievement of the
lower-bound, respectively, on the number of words having each initial, I
can tighten your specification in some of the ways I've queried,
I sense your "tightening" will result in an unreadable spec, but it
would be fun to try.˙ So let's have it.
I would agree that on a scale of necessary to unnecessary, this
challenge lies very close to unnecessary.
But it lies closer to the middle of the scale interesting..uninteresting.
I have a few more up my sleeve.˙ One in particular I've been thinking
about, that explicitly disallows the use of a RNG.
More still, people get gambling games wrong and go to jail because they
learn and practice C programming without any awareness of the difficulty
of "random" and they might read this newsgroup to shape their skills.
You should consult an attorney - I wouldn't want you to do (rand() % 26)
+ 1 days in jail for reading clc and attempting my challenge.
ps you're nuts
I just now uploaded it here: https://filebin.net/kkkyqw1ritefnw0f
In comp.lang.c, DFS <nospam@dfs.com> wrote:
I just now uploaded it here: https://filebin.net/kkkyqw1ritefnw0f
"word" list
$ grep ^.$ ~/tmp/words-unsorted |grep -v [aeiouy] |wc
19 19 38
$ grep ^..$ ~/tmp/words-unsorted |grep -v [aeiouy] |wc
54 54 162
$ grep ^...$ ~/tmp/words-unsorted |grep -v [aeiouy] |wc
74 74 296
$ grep ^....$ ~/tmp/words-unsorted |grep -v [aeiouy] |wc
13 13 65
Elijah
------
not going to use that for Scrabble
On 3/23/2026 6:26 PM, Bart wrote:
It looks very clunky but seems to do the job, and not too slowly
either (see below).
Years ago I was shocked how fast C chewed thru text data (and it's even faster dealing with numbers).
Actually, I'm still shocked.˙ I wrote an anagram program in C that used prime factors to do searches, and it found 5 anagrams from a list of
370K words in 0.0055s (5.5/1000ths of a second).
And it would be even faster with the use of a hash table.˙ Incredible.
And that's on my low-end AMD Ryzen 5600G (16GB DDR4-3200 RAM)
No extra copy of the list is necessary to find duplicates (but forone- > pass efficiency, sorting the list is required).
Look at the first letter of each duplicate.
"congratulations on the wherewithal youngun"
cotwy
Sort the file and the dupes are already sorted.˙ That was intentional.
If that explanation lets you drop some lines, good.
My method of finding the 26 sets was to:
1) count words by letter as the file is read in lettercnt[wordsin[i][0]-'a']++;
(I saw something similar in your scripting, but couldn't spot it in
your .c)
2) sort the data just read in
qsort(wordsin, wordcnt, sizeof(char*), comparechar);
3) using the lettercnt[] array from step 1, determine the start-end positions of each set of words beginning with a..z
Letter˙˙ Start˙˙ End
a˙˙˙˙˙˙˙˙˙˙ 0˙˙ 20484
b˙˙˙˙˙˙ 20485˙˙ 34475
c˙˙˙˙˙˙ 34476˙˙ 60069
d˙˙˙˙˙˙ 60070˙˙ 75050
e˙˙˙˙˙˙ 75051˙˙ 86572
f˙˙˙˙˙˙ 86573˙˙ 95977
g˙˙˙˙˙˙ 95978˙ 104985
h˙˙˙˙˙ 104986˙ 116490
i˙˙˙˙˙ 116491˙ 127653
j˙˙˙˙˙ 127654˙ 129796
k˙˙˙˙˙ 129797˙ 132749
l˙˙˙˙˙ 132750˙ 140949
m˙˙˙˙˙ 140950˙ 157658
n˙˙˙˙˙ 157659˙ 166088
o˙˙˙˙˙ 166089˙ 175859
p˙˙˙˙˙ 175860˙ 205604
q˙˙˙˙˙ 205605˙ 207078
r˙˙˙˙˙ 207079˙ 221162
s˙˙˙˙˙ 221163˙ 253678
t˙˙˙˙˙ 253679˙ 269769
u˙˙˙˙˙ 269770˙ 287936
v˙˙˙˙˙ 287937˙ 292502
w˙˙˙˙˙ 292503˙ 297884
x˙˙˙˙˙ 297885˙ 298330
y˙˙˙˙˙ 298331˙ 299249
z˙˙˙˙˙ 299250˙ 300397
4) generate 100+ random numbers between start and end of each letter
int r = (rand() % (end - start + 1)) + start;
This 'calculation of start and end' for each letter is what I thought to
be a novel approach.
I'm curious how others will approach it (if anyone else tries).
Altogether my program makes:
3 passes thru the 300398 words in:
˙ * 1 to count total words and words by letter
˙ * 1 to load the words into an array
˙ * 1 to find duplicates
2 passes thru the 2600 words out:
˙ * 1 to verify the 100 words per letter
˙ * 1 to print all 2600 words
5 total passes?˙ Not sure that's Ivy League.˙ But everything runs in
1/10th of a second so I can't complain.
If you have a short challenge of medium difficulty, post it so we can
learn and improve skillz.
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large list
of unsorted words possibly containing duplicates - extracts 26 sets of
100 random and unique words that each begin with a letter of the English alphabet.
---------------------
Outputs
---------------------
1) count of words by letter
Letter˙˙ Words In˙˙ Words Out
a˙˙˙˙˙˙˙˙˙ 2345˙˙˙˙˙˙ 100
b˙˙˙˙˙˙˙˙˙ 4399˙˙˙˙˙˙ 100
c˙˙˙˙˙˙˙˙˙˙ 844˙˙˙˙˙˙ 100
...
z˙˙˙˙˙˙˙˙˙ 1011˙˙˙˙˙˙ 100
2) identify duplicate words in the input file (if any) and print
˙˙ them sorted and using proper case on one line.
˙˙ found:˙ eventually dupes you get
˙˙ output: Dupes Eventually Get You
3) print the 2600 words you identify in column x row order in a grid of
˙˙ size (200rows x 13cols or 300x9 or 400x7 or 500x6 or 600x4 etc)
˙˙ without hard-coding each column in a long printf.˙ They must be in
˙˙ alpha order.˙ If you participated in the 'sort of trivial challenge'
˙˙ a few weeks ago, you'll recognize this requirement.
2600 unique random words (1000 rows x 3 columns)
˙˙ 1.˙ aardwolves˙˙˙˙˙˙˙˙˙ kafirin˙˙˙˙˙˙˙˙˙˙˙˙ uberous
˙˙ 2.˙ abaze˙˙˙˙˙˙˙˙˙˙˙˙˙˙ kafiz˙˙˙˙˙˙˙˙˙˙˙˙˙˙ ulnae
˙˙ 3.˙ abitibi˙˙˙˙˙˙˙˙˙˙˙˙ kala˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ ulnare
˙ ...
˙599.˙ funned˙˙˙˙˙˙˙˙˙˙˙˙˙ pyrone˙˙˙˙˙˙˙˙˙˙˙˙˙ zymophosphate
˙600.˙ fusan˙˙˙˙˙˙˙˙˙˙˙˙˙˙ pythiacystis˙˙˙˙˙˙˙ zymotic
˙601.˙ gable˙˙˙˙˙˙˙˙˙˙˙˙˙˙ qanat
˙602.˙ gade˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ qere
˙˙ ...
˙998.˙ juvia˙˙˙˙˙˙˙˙˙˙˙˙˙˙ typedefs
˙999.˙ juxtaposition˙˙˙˙˙˙ tyrannizings
1000.˙ jynx˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ tzaddikim
---------------------
Requirement
---------------------
You must call a RNG 2600+ times to build the list (ie you can't use the random ordering of the input file to your advantage).˙ In repeated runs,
my C and python solutions called the RNG 2635x to 2675x (because of duplicate randoms).
---------------------
Word Source
---------------------
There's a huge unsorted word list here:
https://limewire.com/?referrer=pq7i8xx7p2
...which you can develop against.
My C and python solutions are shown below, and at the same link.
No code perusal until you submit yours!
Enjoy!
========================================================================
C˙ 125 LOC
On my WSL system this C runs in 0.095 seconds using the unsorted words file ========================================================================
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <ctype.h>˙ //for tolower and toupper
//example usage = $./2600words words_unsorted.txt 500 6
//string compare function for qsort
int comparechar( const void *a, const void *b){
˙˙˙˙const char **chara = (const char **)a;
˙˙˙ const char **charb = (const char **)b;
˙˙˙ return strcmp(*chara, *charb);
}
int main(int argc, char *argv[]) {
˙˙˙˙//validations
˙˙˙˙if ((argc) < 4) {
˙˙˙˙˙˙˙ printf("Invalid input \nEnter program-name˙ word-file˙ rows columns\n");
˙˙˙˙˙˙˙ printf("example: $./2600words words.txt 400˙ 7\n\n");
˙˙˙˙˙˙˙ exit(0);
˙˙˙˙}
˙˙˙˙if (atoi(argv[2]) * atoi(argv[3]) < 2600) {
˙˙˙˙˙˙˙ printf("Invalid input: enter rows * columns that total 2600+ \n\n");
˙˙˙˙˙˙˙ exit(0);
˙˙˙˙}
˙˙˙˙int˙ i = 0, t = 0, wout = 0;˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ // counters
˙˙˙˙int˙ lettercnt[26] = {0};˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //hold count of words by first letter
˙˙˙˙int˙ maxwordlen = 0;˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
length of longest word in list
˙˙˙˙int˙ start = 0, end = 0;˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //used
to extract 100 words per letter
˙˙˙˙int˙ temp[100] = {0};˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
holds the 100 random words for the letter
˙˙˙˙int˙ wordcnt = 0, totwords = 0;˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
used to extract 100 words per letter
˙˙˙˙char line[35] = "";˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ // buffer to hold line when reading file
˙˙˙˙char therand[9];˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //the current random value
˙˙˙˙char usedlist[1000];˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
stores the random numbers already used
˙˙˙˙// ===========================================================================
˙˙˙˙//nitty gritty - read in the unsorted words
˙˙˙˙// ===========================================================================
˙˙˙˙FILE *fin = fopen(argv[1],"r");˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
open file
˙˙˙˙while (fgets(line,sizeof line, fin)!= NULL) {˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
count lines = words, get max word length
˙˙˙˙˙˙˙ wordcnt++;
˙˙˙˙˙˙˙ if (strlen(line) > maxwordlen) {
˙˙˙˙˙˙˙˙˙˙˙ maxwordlen = strlen(line);
˙˙˙˙˙˙˙ }
˙˙˙˙}
˙˙˙˙char theword[maxwordlen+1];
˙˙˙˙rewind(fin);˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
pointer back to beginning
˙˙˙˙char **wordsin˙ = malloc(sizeof(char*) * wordcnt);˙˙˙˙˙˙˙˙˙˙˙ // allocate memory
˙˙˙˙while (fgets(theword,sizeof theword, fin) != NULL) {˙˙˙˙˙˙˙ //read
line into buffer
˙˙˙˙˙˙˙ int wordlen = strlen(theword);˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //get length of word
˙˙˙˙˙˙˙ wordsin[i] = malloc(wordlen + 1);˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ // allocate memory for the word
˙˙˙˙˙˙˙ strncpy(wordsin[i], theword, wordlen);˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
copy word into array
˙˙˙˙˙˙˙ wordsin[i][wordlen-1] = '\0';˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //add terminator - overwrites the \n in the file
˙˙˙˙˙˙˙ lettercnt[wordsin[i][0]-'a']++;˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ // update count of words by first letter
˙˙˙˙˙˙˙ i++;˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ // increment counter
˙˙˙˙}
˙˙˙˙fclose(fin);˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //close handle to file
˙˙˙˙// ===========================================================================
˙˙˙˙//fun begins
˙˙˙˙// ===========================================================================
˙˙˙˙//sort master list of words
˙˙˙˙//for each letter, determine the start and end positions of words beginning with that letter
˙˙˙˙//generate random numbers between that start and end
˙˙˙˙//check if that random number is in the usedlist array.˙ If not,
add it to usedlist and temp arrays
˙˙˙˙//when temp array has 100 unique randoms in it, add them to the
master array, break and go to next letter
˙˙˙˙//do one sort at the end
˙˙˙˙qsort(wordsin, wordcnt, sizeof(char*), comparechar);˙˙˙˙˙˙˙ //sort
the master
˙˙˙˙char **wordsout = malloc(sizeof(char*) * 2600);˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
final output goes into this array
˙˙˙˙srand(time(NULL));
˙˙˙˙for (i = 0; i < 26; i++) {˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ //
find start-end for each letter set
˙˙˙˙˙˙˙ start = (totwords += lettercnt[i]) - lettercnt[i];
˙˙˙˙˙˙˙ end = start + lettercnt[i] - 1;
˙˙˙˙˙˙˙ memset(usedlist, 0, sizeof(usedlist));
˙˙˙˙˙˙˙ memset(temp,˙˙˙˙ 0, 100);
˙˙˙˙˙˙˙ t = 0;
˙˙˙˙˙˙˙ for (int j = 0; j < 200; j++) {
˙˙˙˙˙˙˙˙˙˙˙ int r = (rand() % (end - start + 1)) + start;
˙˙˙˙˙˙˙˙˙˙˙ sprintf(therand," %d ", r);
˙˙˙˙˙˙˙˙˙˙˙ if (strstr(usedlist, therand) == NULL) {
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ strncat(usedlist, therand, strlen(therand));
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ temp[t++] = r;
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ if (t > 100) {
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ for (int j = 0; j < 100; j++) {
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ sprintf(theword, "%s", wordsin[temp[j]]);
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ int wordlen = strlen(theword);
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ wordsout[wout] = malloc(wordlen + 1);
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ strncpy(wordsout[wout], theword, wordlen);
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ wordsout[wout++][wordlen] = '\0';
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ }
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ break;
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ }
˙˙˙˙˙˙˙˙˙˙˙ }
˙˙˙˙˙˙˙ }
˙˙˙˙}
˙˙˙˙qsort(wordsout, wout, sizeof(char*), comparechar);˙˙˙ //final sort
of 2600 words
˙˙˙˙// ===================================================================================================
˙˙˙˙//final output: print word counts by letter, print dupes, print
random words by column then row
˙˙˙˙// ===================================================================================================
˙˙˙˙printf("%d words loaded\n",wordcnt);
˙˙˙˙if(wout == 2600) {
˙˙˙˙˙˙˙ printf("list of 2600 unique random words created\n");
˙˙˙˙˙˙˙ printf("\nLetter˙˙ Words In˙˙ Words Out\n");
˙˙˙˙˙˙˙ t = 0;
˙˙˙˙˙˙˙ for (i = 0; i < 26; i ++) {
˙˙˙˙˙˙˙˙˙˙˙ t = 0;
˙˙˙˙˙˙˙˙˙˙˙ for (int j = 0; j < wout; j++) {
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ if (wordsout[j][0] == (i + 97)) {t++;}
˙˙˙˙˙˙˙˙˙˙˙ }
˙˙˙˙˙˙˙˙˙˙˙ printf("˙ %2c˙˙˙ %6d˙˙˙˙˙˙˙ %d\n", i+97, lettercnt[i], t);
˙˙˙˙˙˙˙ }
˙˙˙˙} else {
˙˙˙˙˙˙˙ printf("Errors occurred.˙ 2600 words not produced.\n");
˙˙˙˙˙˙˙ exit(0);
˙˙˙˙}
˙˙˙˙//duplicate words
˙˙˙˙printf("\nDuplicate words in proper case\n");
˙˙˙˙for (i = 0; i < wordcnt-1; i++) {
˙˙˙˙˙˙˙ if (strcmp(wordsin[i], wordsin[i+1]) == 0) {
˙˙˙˙˙˙˙˙˙˙˙ sprintf(theword, "%s", wordsin[i]);
˙˙˙˙˙˙˙˙˙˙˙ for (int i = 0; theword[i] != '\0'; i++) {
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ if (i == 0) {theword[i] = toupper(theword[i]);}
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ if (i >˙ 0) {theword[i] = tolower(theword[i]);}
˙˙˙˙˙˙˙˙˙˙˙ }
˙˙˙˙˙˙˙˙˙˙˙ printf("%s ", theword);
˙˙˙˙˙˙˙ }
˙˙˙˙}
˙˙˙˙//print random words in column then row order
˙˙˙˙int rows = atoi(argv[2]), cols = atoi(argv[3]), colwidth = 20;
˙˙˙˙printf("\n\n2600 unique random words (%d rows x %d columns)\n",
rows, cols);
˙˙˙˙for (int r = 1; r <= rows; r++) {
˙˙˙˙˙˙˙ if (r <= wout) {
˙˙˙˙˙˙˙˙˙˙˙ int nbr = r;
˙˙˙˙˙˙˙˙˙˙˙ printf("%3d. %-*s", r, colwidth, wordsout[nbr-1]);
˙˙˙˙˙˙˙˙˙˙˙ for (int i = 0; i < cols-1; i++) {
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ nbr += rows;
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ if (nbr <= wout) {
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ printf("%-*s", colwidth, wordsout[nbr-1]);
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ }
˙˙˙˙˙˙˙˙˙˙˙ }
˙˙˙˙˙˙˙˙˙˙˙ printf("\n");
˙˙˙˙˙˙˙ }
˙˙˙˙}
˙˙˙˙//finito
˙˙˙˙free(wordsin);
˙˙˙˙free(wordsout);
˙˙˙˙return 0;
}
========================================================================
======================================================================== python˙ 58 LOC
On my WSL system this python runs in 1.05 seconds using the unsorted
words file ========================================================================
import sys, random
if len(sys.argv) < 4:
˙˙˙˙print("Invalid input \nEnter program name word-file rows columns")
˙˙˙˙print("example: $ python3˙ 2600words.py˙ words.txt˙ 400˙ 7")
˙˙˙˙exit()
if (int(sys.argv[2]) * int(sys.argv[3])) < 2600:
˙˙˙˙print("Invalid input: enter rows * columns that total 2600+")
˙˙˙˙exit()
#read unsorted words file, generate 100 randoms per letter
from collections import Counter
wordsout, used, temp = [],[],[]
lettercnt = [0]*26
with open(sys.argv[1],'r') as f:
˙˙˙˙wordsin = f.readlines()
˙˙˙˙for line in wordsin:
˙˙˙˙˙˙˙ lettercnt[ord(line[0])-97] += 1
˙˙˙˙print("%d words loaded" % (len(wordsin)))
˙˙˙˙wordsuni = sorted(set(wordsin))
˙˙˙˙for letter in 'abcdefghijklmnopqrstuvwxyz':
˙˙˙˙˙˙˙ used.clear()
˙˙˙˙˙˙˙ temp.clear()
˙˙˙˙˙˙˙ lwords = [ line for line in wordsuni if line[0] == letter ]
˙˙˙˙˙˙˙ lenwordset = len(lwords)
˙˙˙˙˙˙˙ for i in range(200):
˙˙˙˙˙˙˙˙˙˙˙ randword = lwords[random.randint(0, lenwordset-1)]
˙˙˙˙˙˙˙˙˙˙˙ if randword not in used:
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ temp.append(randword.rstrip())
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ used.append(randword)
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ if len(temp) == 100:
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ wordsout += sorted(temp)
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ break
print('list of ' + str(len(wordsout)) + ' unique random words created')
#words out should always be 100 per letter
print("\nLetter˙˙ Words In˙˙ Words Out")
for i in range(26):
˙˙˙˙wout = 0
˙˙˙˙for j in range(len(wordsout)):
˙˙˙˙˙˙˙ if (ord(wordsout[j][0]) == (i + 97)):
˙˙˙˙˙˙˙˙˙˙˙ wout += 1
˙˙˙˙print("˙ %2c˙˙˙ %6d˙˙˙˙˙˙˙ %d" % (i+97, lettercnt[i], wout))
#find duplicate words
print("\nDuplicate words in proper case")
counts = Counter(wordsin)
dupes˙ = [item for item, count in counts.items() if count > 1]
if len(dupes) > 0:
˙˙˙˙for dupe in sorted(dupes):
˙˙˙˙˙˙˙ print(dupe.strip().title(), end = ' ')
else:
˙˙˙˙print("no duplicate words")
#print randoms by col then row
rows, cols = int(sys.argv[2]), int(sys.argv[3])
colwidth, words = 20, len(wordsout)
print("\n\n2600 unique random words (%d rows x %d columns)" % (rows, cols)) for r in range(1,rows + 1):
˙˙˙˙if r <= words:
˙˙˙˙˙˙˙ nbr = r
˙˙˙˙˙˙˙ print("%3d. %-*s" % (nbr, colwidth, wordsout[nbr-1]), end = ' ')
˙˙˙˙˙˙˙ for i in range(cols-1):
˙˙˙˙˙˙˙˙˙˙˙ nbr += rows
˙˙˙˙˙˙˙˙˙˙˙ if nbr <= words:
˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ print("%-*s" % (colwidth, wordsout[nbr-1]), end = ' ')
˙˙˙˙˙˙˙ print()
====================================================================
On 24/03/2026 17:06, DFS wrote:
On 3/23/2026 6:26 PM, Bart wrote:
1) count words by letter as the file is read in
lettercnt[wordsin[i][0]-'a']++;
(I saw something similar in your scripting, but couldn't spot it in
your .c)
It's probably this line:
˙˙˙˙˙˙˙˙˙ ++nbig[(unsigned char)buffer[0]]
The cast is because 'char' is signed and could be negative.
Note that my arrays can have arbitrary lower bounds (this is a rare
feature among HLLS), and here start from 'a'.
In shell ...
----
#!/bin/ksh
sort words_unsorted.txt >words.srt
uniq words.srt >words.unq
cat <<-X
˙˙˙ There are $(wc -l words.srt |\
˙˙˙˙˙˙˙ cut -d\˙ -f1) words in words_unsorted.txt
˙˙˙ and $(wc -l words.unq | cut -d\˙ -f1) are unique.
˙˙˙ Duplicates are: $(diff words.srt words.unq |\
˙˙˙˙˙˙˙ grep "<" | cut -d\˙ -f2 | sed "s/^\(.\)/\u\1/g" |\
˙˙˙˙˙˙˙ sort | tr '\n' ' ')
˙˙˙ Counts:
˙˙˙ $(
˙˙˙˙˙˙˙ for X in {a..z}
˙˙˙˙˙˙˙ do
˙˙˙˙˙˙˙˙˙˙˙ echo -n "$X " ; grep -c ^$X words.unq
˙˙˙˙˙˙˙ done
˙˙˙ )
˙˙˙ Samples ...
X
for X in {a..z}
do
˙˙˙ grep ^$X words.unq | shuf -n 100 | sed "s/^\(.\)/\u\1/g"
done >words.tmp
head -1000 words.tmp >words.0
head -2000 words.tmp | tail -1000 >words.1
tail -600 words.tmp > words.2
paste words.0 words.1 words.2 | nl -w4 -s". "
rm words.srt words.unq words.0 words.1 words.2
return 0
On 24/03/2026 17:17, DFS wrote:
On 3/24/2026 7:03 AM, Tristan Wibberley wrote:
On 23/03/2026 04:06, DFS wrote:
On 3/22/2026 8:05 PM, Tristan Wibberley wrote:
On 22/03/2026 23:14, DFS wrote:
On 3/22/2026 7:02 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large >>>>>>>> list
of unsorted words possibly containing duplicates - extracts 26 >>>>>>>> sets of
100 random and unique words that each begin with a letter of the >>>>>>>> English
alphabet.
What random distribution, uniform?
Said distribution over the unique words or said distribution over the >>>>>>> original list?
pseudorandom?
I don't care about the uniformity of the distribution, as long as the >>>>>> output is unique words, and you generate and use 2600+ random values >>>>>> from a RNG.
I think you're unaware that I can predictably generate a sequence of >>>>> identical values when the distribution is free and your
specification is
satisfied by selecting with a distribution that prefers just one
indicatory value for a choice of word to the exclusion of all others. >>>>
Yeah, I don't really know what any of that means.˙ But it sounds like
your 3rd attempt to sidestep the generation and use of 2600+ random
values.
I think you could show some interesting techniques, but you have to
adhere to the requirements of the challenge.
It's because of my deeper understanding of the meaning (or barely
meaningfulness) of the word "random" and my awareness of how critical it >>> is to many applications of randomness.
That is:
˙ - If it's a game I could go ahead with a PRNG and satisfy you easily - >>> but it's not interesting to me these days, at this point I think it's a
game,
A game... now you're onto me.
˙ - If it's a secure application of choice that leaks /no/ information
about the input list beyond the fact of the achievement of the
lower-bound, respectively, on the number of words having each initial, I >>> can tighten your specification in some of the ways I've queried,
I sense your "tightening" will result in an unreadable spec, but it
would be fun to try.˙ So let's have it.
I don't think it would be unreadable. It might require some thoughtto synthesise a program that satisfies it.
But you've told me its just a game (I suppose "toy", rather than
gambling game). So the interesting bit could be just like:
"produce the output so that, within each of the groupings based on the initial letter, the words are superficially shuffled around even when
they're not shuffled around in the input."
challenge lies very close to unnecessary.
I didn't mean to suggest the challenge is unnecessary, but mean to
discuss the problems of writing specifications and requirements such
that they're not necessary to fulfil the goal. Requirements involving randomness and secrecy are particularly interesting in that respect.
But it lies closer to the middle of the scale interesting..uninteresting.
I have a few more up my sleeve.˙ One in particular I've been thinking
about, that explicitly disallows the use of a RNG.
More still, people get gambling games wrong and go to jail because they
learn and practice C programming without any awareness of the difficulty >>> of "random" and they might read this newsgroup to shape their skills.
You should consult an attorney - I wouldn't want you to do (rand() % 26)
+ 1 days in jail for reading clc and attempting my challenge.
You seem to be assuming every reader is just fulfilling a need for a passtime. I don't suppose that and you seem to be mocking me for it; I
think that's awful. You'd got me really excited about the breadth of
nuance in requirements and the effects of that and then turned it into
an opportunity for mockery.
ps you're nuts
That may be, but how did you know?
On 3/25/2026 8:32 AM, Richard Harnden wrote:
In shell ...
----
#!/bin/ksh
sort words_unsorted.txt >words.srt
uniq words.srt >words.unq
cat <<-X
˙˙˙˙ There are $(wc -l words.srt |\
˙˙˙˙˙˙˙˙ cut -d\˙ -f1) words in words_unsorted.txt
˙˙˙˙ and $(wc -l words.unq | cut -d\˙ -f1) are unique.
˙˙˙˙ Duplicates are: $(diff words.srt words.unq |\
˙˙˙˙˙˙˙˙ grep "<" | cut -d\˙ -f2 | sed "s/^\(.\)/\u\1/g" |\
˙˙˙˙˙˙˙˙ sort | tr '\n' ' ')
˙˙˙˙ Counts:
˙˙˙˙ $(
˙˙˙˙˙˙˙˙ for X in {a..z}
˙˙˙˙˙˙˙˙ do
˙˙˙˙˙˙˙˙˙˙˙˙ echo -n "$X " ; grep -c ^$X words.unq
˙˙˙˙˙˙˙˙ done
˙˙˙˙ )
˙˙˙˙ Samples ...
X
for X in {a..z}
do
˙˙˙˙ grep ^$X words.unq | shuf -n 100 | sed "s/^\(.\)/\u\1/g"
done >words.tmp
head -1000 words.tmp >words.0
head -2000 words.tmp | tail -1000 >words.1
tail -600 words.tmp > words.2
paste words.0 words.1 words.2 | nl -w4 -s". "
rm words.srt words.unq words.0 words.1 words.2
return 0
29 lines.˙ Speed is good.˙ Very nice.˙ Probably took you no more than an hour to write.
How do I run it?˙ I tried this in Windows Subsystem for Linux:
$ sudo ksh harnden.sh
: not found2]:
uniq: words.srt: No such file or directory
: not found5]:
: not found6]:
wc: words.srt: No such file or directory
Unfortunately...
* the output doesn't meet the requirements: the words are unique and
˙ sorted, but they're not each randomly chosen by a RNG().
* you hard-coded your output logic: 3 groups of words in 3 columns.
* you didn't offer a C solution.
On 26/03/2026 03:21, DFS wrote:
On 3/25/2026 8:32 AM, Richard Harnden wrote:
In shell ...
----
#!/bin/ksh
sort words_unsorted.txt >words.srt
uniq words.srt >words.unq
cat <<-X
˙˙˙˙ There are $(wc -l words.srt |\
˙˙˙˙˙˙˙˙ cut -d\˙ -f1) words in words_unsorted.txt
˙˙˙˙ and $(wc -l words.unq | cut -d\˙ -f1) are unique.
˙˙˙˙ Duplicates are: $(diff words.srt words.unq |\
˙˙˙˙˙˙˙˙ grep "<" | cut -d\˙ -f2 | sed "s/^\(.\)/\u\1/g" |\
˙˙˙˙˙˙˙˙ sort | tr '\n' ' ')
˙˙˙˙ Counts:
˙˙˙˙ $(
˙˙˙˙˙˙˙˙ for X in {a..z}
˙˙˙˙˙˙˙˙ do
˙˙˙˙˙˙˙˙˙˙˙˙ echo -n "$X " ; grep -c ^$X words.unq
˙˙˙˙˙˙˙˙ done
˙˙˙˙ )
˙˙˙˙ Samples ...
X
for X in {a..z}
do
˙˙˙˙ grep ^$X words.unq | shuf -n 100 | sed "s/^\(.\)/\u\1/g"
done >words.tmp
head -1000 words.tmp >words.0
head -2000 words.tmp | tail -1000 >words.1
tail -600 words.tmp > words.2
paste words.0 words.1 words.2 | nl -w4 -s". "
rm words.srt words.unq words.0 words.1 words.2
return 0
29 lines.˙ Speed is good.˙ Very nice.˙ Probably took you no more than
an hour to write.
How do I run it?˙ I tried this in Windows Subsystem for Linux:
$ sudo ksh harnden.sh
: not found2]:
uniq: words.srt: No such file or directory
: not found5]:
: not found6]:
wc: words.srt: No such file or directory
Make sure that words_unsorted.txt is in the same directory,
that harnden.sh is executable,
then: ./harnden.sh
No need for sudo.
Unfortunately...
* the output doesn't meet the requirements: the words are unique and
˙˙ sorted, but they're not each randomly chosen by a RNG().
shuf(1) will call rand(3)
* you hard-coded your output logic: 3 groups of words in 3 columns.
True, but it satisfies your "2600 unique random words (1000 rows x 3 columns)"
* you didn't offer a C solution.
No, I wanted to see if a shell solution was 'good enough'.
On 3/26/2026 12:02 AM, Richard Harnden wrote:
On 26/03/2026 03:21, DFS wrote:
On 3/25/2026 8:32 AM, Richard Harnden wrote:
In shell ...
----
#!/bin/ksh
sort words_unsorted.txt >words.srt
uniq words.srt >words.unq
cat <<-X
˙˙˙˙ There are $(wc -l words.srt |\
˙˙˙˙˙˙˙˙ cut -d\˙ -f1) words in words_unsorted.txt
˙˙˙˙ and $(wc -l words.unq | cut -d\˙ -f1) are unique.
˙˙˙˙ Duplicates are: $(diff words.srt words.unq |\
˙˙˙˙˙˙˙˙ grep "<" | cut -d\˙ -f2 | sed "s/^\(.\)/\u\1/g" |\
˙˙˙˙˙˙˙˙ sort | tr '\n' ' ')
˙˙˙˙ Counts:
˙˙˙˙ $(
˙˙˙˙˙˙˙˙ for X in {a..z}
˙˙˙˙˙˙˙˙ do
˙˙˙˙˙˙˙˙˙˙˙˙ echo -n "$X " ; grep -c ^$X words.unq
˙˙˙˙˙˙˙˙ done
˙˙˙˙ )
˙˙˙˙ Samples ...
X
for X in {a..z}
do
˙˙˙˙ grep ^$X words.unq | shuf -n 100 | sed "s/^\(.\)/\u\1/g"
done >words.tmp
head -1000 words.tmp >words.0
head -2000 words.tmp | tail -1000 >words.1
tail -600 words.tmp > words.2
paste words.0 words.1 words.2 | nl -w4 -s". "
rm words.srt words.unq words.0 words.1 words.2
return 0
29 lines.˙ Speed is good.˙ Very nice.˙ Probably took you no more than
an hour to write.
How do I run it?˙ I tried this in Windows Subsystem for Linux:
$ sudo ksh harnden.sh
: not found2]:
uniq: words.srt: No such file or directory
: not found5]:
: not found6]:
wc: words.srt: No such file or directory
Make sure that words_unsorted.txt is in the same directory,
that harnden.sh is executable,
then: ./harnden.sh
No need for sudo.
$ ksh ./harnden.sh: not foundh[2]:
: cannot create [Permission denied]
: cannot create [Permission denied]
: not foundh[5]:
wc: words.srt: No such file or directory
: not foundh[6]:
words_unsorted.txt is in the directory
After that runs:
˙words.srt is created (and the words are sorted)
˙words.unq is empty
On 3/24/2026 8:29 PM, Tristan Wibberley wrote:
On 24/03/2026 17:17, DFS wrote:
On 3/24/2026 7:03 AM, Tristan Wibberley wrote:
On 23/03/2026 04:06, DFS wrote:
On 3/22/2026 8:05 PM, Tristan Wibberley wrote:
On 22/03/2026 23:14, DFS wrote:
On 3/22/2026 7:02 PM, Tristan Wibberley wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a >>>>>>>>> large
list
of unsorted words possibly containing duplicates - extracts 26 >>>>>>>>> sets of
100 random and unique words that each begin with a letter of the >>>>>>>>> English
alphabet.
What random distribution, uniform?
Said distribution over the unique words or said distribution
over the
original list?
pseudorandom?
I don't care about the uniformity of the distribution, as long as >>>>>>> the
output is unique words, and you generate and use 2600+ random values >>>>>>> from a RNG.
I think you're unaware that I can predictably generate a sequence of >>>>>> identical values when the distribution is free and your
specification is
satisfied by selecting with a distribution that prefers just one
indicatory value for a choice of word to the exclusion of all others. >>>>>
Yeah, I don't really know what any of that means.˙ But it sounds like >>>>> your 3rd attempt to sidestep the generation and use of 2600+ random
values.
I think you could show some interesting techniques, but you have to
adhere to the requirements of the challenge.
It's because of my deeper understanding of the meaning (or barely
meaningfulness) of the word "random" and my awareness of how
critical it
is to many applications of randomness.
That is:
˙˙ - If it's a game I could go ahead with a PRNG and satisfy you
easily -
but it's not interesting to me these days, at this point I think it's a >>>> game,
A game... now you're onto me.
˙˙ - If it's a secure application of choice that leaks /no/ information >>>> about the input list beyond the fact of the achievement of the
lower-bound, respectively, on the number of words having each
initial, I
can tighten your specification in some of the ways I've queried,
I sense your "tightening" will result in an unreadable spec, but it
would be fun to try.˙ So let's have it.
I don't think it would be unreadable. It might require some thoughtto
synthesise a program that satisfies it.
But you've told me its just a game (I suppose "toy", rather than
gambling game). So the interesting bit could be just like:
"produce the output so that, within each of the groupings based on the
initial letter, the words are superficially shuffled around even when
they're not shuffled around in the input."
So a sorted input list is loaded, then the groupings by letter are
shuffled "superficially".
What constitutes a superficial shuffle?
.What constitutes a superficial shuffle?
I don't know, it's you that said the random distribution wasn't important.
On 24/03/2026 17:06, DFS wrote:
On 3/23/2026 6:26 PM, Bart wrote:
It looks very clunky but seems to do the job, and not too slowly
either (see below).
Years ago I was shocked how fast C chewed thru text data (and it's
even faster dealing with numbers).
Actually, I'm still shocked.˙ I wrote an anagram program in C that
used prime factors to do searches, and it found 5 anagrams from a list
of 370K words in 0.0055s (5.5/1000ths of a second).
You're attributing too much to C. Or maybe comparing it too much to
Python which is very slow.
There are other factors: hardware today is incredibly fast (like 4 magnitudes or more faster than the 8-bit machines I started off on).
And a lot of it is due to the optimising compilers now available.
My own systems language is aso quite low-level. And can be just as fast
if someone were to write an optimising compiler for it too!
(As it is, it's not far off. Its self-hosted compiler can build over 20
new generations of itself per second, on a machine slower than yours.)
And it would be even faster with the use of a hash table.˙ Incredible.
And that's on my low-end AMD Ryzen 5600G (16GB DDR4-3200 RAM)
If that's low-end, what would be high-end? I mean in desktop computer
terms not some supercomputer.
No extra copy of the list is necessary to find duplicates (but forone- > pass efficiency, sorting the list is required).
Look at the first letter of each duplicate.
"congratulations on the wherewithal youngun"
cotwy
Sort the file and the dupes are already sorted.˙ That was intentional.
If that explanation lets you drop some lines, good.
I'm now down to 150 sloc for the C version, and 125 sloc for the M version.
My method of finding the 26 sets was to:
1) count words by letter as the file is read in
lettercnt[wordsin[i][0]-'a']++;
(I saw something similar in your scripting, but couldn't spot it in
your .c)
It's probably this line:
˙˙˙˙˙˙˙˙˙ ++nbig[(unsigned char)buffer[0]]
The cast is because 'char' is signed and could be negative.
Note that my arrays can have arbitrary lower bounds (this is a rare
feature among HLLS),
and here start from 'a'.
2) sort the data just read in
qsort(wordsin, wordcnt, sizeof(char*), comparechar);
3) using the lettercnt[] array from step 1, determine the start-end
positions of each set of words beginning with a..z
Letter˙˙ Start˙˙ End
a˙˙˙˙˙˙˙˙˙˙ 0˙˙ 20484
b˙˙˙˙˙˙ 20485˙˙ 34475
c˙˙˙˙˙˙ 34476˙˙ 60069
d˙˙˙˙˙˙ 60070˙˙ 75050
e˙˙˙˙˙˙ 75051˙˙ 86572
f˙˙˙˙˙˙ 86573˙˙ 95977
g˙˙˙˙˙˙ 95978˙ 104985
h˙˙˙˙˙ 104986˙ 116490
i˙˙˙˙˙ 116491˙ 127653
j˙˙˙˙˙ 127654˙ 129796
k˙˙˙˙˙ 129797˙ 132749
l˙˙˙˙˙ 132750˙ 140949
m˙˙˙˙˙ 140950˙ 157658
n˙˙˙˙˙ 157659˙ 166088
o˙˙˙˙˙ 166089˙ 175859
p˙˙˙˙˙ 175860˙ 205604
q˙˙˙˙˙ 205605˙ 207078
r˙˙˙˙˙ 207079˙ 221162
s˙˙˙˙˙ 221163˙ 253678
t˙˙˙˙˙ 253679˙ 269769
u˙˙˙˙˙ 269770˙ 287936
v˙˙˙˙˙ 287937˙ 292502
w˙˙˙˙˙ 292503˙ 297884
x˙˙˙˙˙ 297885˙ 298330
y˙˙˙˙˙ 298331˙ 299249
z˙˙˙˙˙ 299250˙ 300397
4) generate 100+ random numbers between start and end of each letter
int r = (rand() % (end - start + 1)) + start;
This 'calculation of start and end' for each letter is what I thought
to be a novel approach.
I'm curious how others will approach it (if anyone else tries).
I don't understand what's going on there.
If there are N words in total
that start with 'c', say, then I just generate a random number from 0 to
N-1 (C), or 1 to N (M).
Altogether my program makes:
3 passes thru the 300398 words in:
˙˙ * 1 to count total words and words by letter
˙˙ * 1 to load the words into an array
˙˙ * 1 to find duplicates
2 passes thru the 2600 words out:
˙˙ * 1 to verify the 100 words per letter
˙˙ * 1 to print all 2600 words
5 total passes?˙ Not sure that's Ivy League.˙ But everything runs in
1/10th of a second so I can't complain.
In 0.25 seconds on my machine! This is why it can better to not use the fastest machine around, then you can spot inefficiencies more easily.
That was Windows; on WSL it was a little slower: 0.4 seconds 'real' time.
If you have a short challenge of medium difficulty, post it so we can
learn and improve skillz.
I tried the same program on an unsorted list 10 times the size. That is, just duplicating everything to get a 3,003,980-line file.
Generally programs still worked, but took longer, and the list of
duplicates was a bit bigger!
The Python version took 4.2s or 5s on PyPy. My Q version got much slower
at 14s (maybe the interpreted sort is the reason).
Your C version was 4.5s. Mine are 3.x but they cap the duplicates at
100 so they can't be compared.
On 3/25/2026 8:32 AM, Richard Harnden wrote:
In shell ...
----
#!/bin/ksh
sort words_unsorted.txt >words.srt
uniq words.srt >words.unq
cat <<-X
˙˙˙ There are $(wc -l words.srt |\
˙˙˙˙˙˙˙ cut -d\˙ -f1) words in words_unsorted.txt
˙˙˙ and $(wc -l words.unq | cut -d\˙ -f1) are unique.
˙˙˙ Duplicates are: $(diff words.srt words.unq |\
˙˙˙˙˙˙˙ grep "<" | cut -d\˙ -f2 | sed "s/^\(.\)/\u\1/g" |\
˙˙˙˙˙˙˙ sort | tr '\n' ' ')
˙˙˙ Counts:
˙˙˙ $(
˙˙˙˙˙˙˙ for X in {a..z}
˙˙˙˙˙˙˙ do
˙˙˙˙˙˙˙˙˙˙˙ echo -n "$X " ; grep -c ^$X words.unq
˙˙˙˙˙˙˙ done
˙˙˙ )
˙˙˙ Samples ...
X
for X in {a..z}
do
˙˙˙ grep ^$X words.unq | shuf -n 100 | sed "s/^\(.\)/\u\1/g"
done >words.tmp
head -1000 words.tmp >words.0
head -2000 words.tmp | tail -1000 >words.1
tail -600 words.tmp > words.2
paste words.0 words.1 words.2 | nl -w4 -s". "
rm words.srt words.unq words.0 words.1 words.2
return 0
29 lines. Speed is good. Very nice. Probably took you no more than an >hour to write.
How do I run it? I tried this in Windows Subsystem for Linux:
$ sudo ksh harnden.sh
: not found2]:
uniq: words.srt: No such file or directory
: not found5]:
: not found6]:
wc: words.srt: No such file or directory
Unfortunately...
* the output doesn't meet the requirements: the words are unique and
sorted, but they're not each randomly chosen by a RNG().
----
#!/bin/ksh
sort words_unsorted.txt >words.srt
uniq words.srt >words.unq
cat <<-X
˙˙˙ There are $(wc -l words.srt |\
˙˙˙˙˙˙˙ cut -d\˙ -f1) words in words_unsorted.txt
˙˙˙ and $(wc -l words.unq | cut -d\˙ -f1) are unique.
˙˙˙ Duplicates are: $(diff words.srt words.unq |\
˙˙˙˙˙˙˙ grep "<" | cut -d\˙ -f2 | sed "s/^\(.\)/\u\1/g" |\
˙˙˙˙˙˙˙ sort | tr '\n' ' ')
˙˙˙ Counts:
˙˙˙ $(
˙˙˙˙˙˙˙ for X in {a..z}
˙˙˙˙˙˙˙ do
˙˙˙˙˙˙˙˙˙˙˙ echo -n "$X " ; grep -c ^$X words.unq
˙˙˙˙˙˙˙ done
˙˙˙ )
˙˙˙ Samples ...
X
for X in {a..z}
do
˙˙˙ grep ^$X words.unq | shuf -n 100 | sed "s/^\(.\)/\u\1/g"
done >words.tmp
head -1000 words.tmp >words.0
head -2000 words.tmp | tail -1000 >words.1
tail -600 words.tmp > words.2
paste words.0 words.1 words.2 | nl -w4 -s". "
rm words.srt words.unq words.0 words.1 words.2
return 0
----
it can better to not use the fastest machine around, then you can spot inefficiencies more easily.
On 3/25/2026 7:54 AM, Bart wrote:
You should be set forever.
I haven't priced out a full system in a long time, and RAM prices have surged the last 6 months, but you can probably still get a smokin' fast tower computer with a low-end-but-plenty-fast-enough video card for
$2500 to $3000.
Research, order and build it yourself to save $500+.
No extra copy of the list is necessary to find duplicates (but forone- > pass efficiency, sorting the list is required).
Look at the first letter of each duplicate.
"congratulations on the wherewithal youngun"
cotwy
Sort the file and the dupes are already sorted.˙ That was intentional.
If that explanation lets you drop some lines, good.
I'm now down to 150 sloc for the C version, and 125 sloc for the M
version.
I'm at 146 (but if the dupes were unsorted would need a few more)
Note that my arrays can have arbitrary lower bounds (this is a rare
feature among HLLS),
Sounds dangerous.
I don't understand what's going on there.
sorted array
--------------------------------------
˙˙˙˙˙˙˙˙˙˙˙ Position in Array
Letter˙ WordCnt˙˙˙˙˙˙ Start˙˙ End
--------------------------------------
a˙˙˙˙˙˙ 20485˙˙˙˙˙˙˙˙˙˙˙˙ 0˙˙˙ 20484
b˙˙˙˙˙˙ 13991˙˙˙˙˙˙˙˙ 20485˙˙˙ 34475
c˙˙˙˙˙˙ 25594˙˙˙˙˙˙˙˙ 34476˙˙˙ 60069
d˙˙˙˙˙˙ 14981˙˙˙˙˙˙˙˙ 60070˙˙˙ 75050
e˙˙˙˙˙˙ 11522˙˙˙˙˙˙˙˙ 75051˙˙˙ 86572
f˙˙˙˙˙˙˙ 9405˙˙˙˙˙˙˙˙ 86573˙˙˙ 95977
g˙˙˙˙˙˙˙ 9008˙˙˙˙˙˙˙˙ 95978˙˙ 104985
h˙˙˙˙˙˙ 11505˙˙˙˙˙˙˙ 104986˙˙ 116490
i˙˙˙˙˙˙ 11163˙˙˙˙˙˙˙ 116491˙˙ 127653
j˙˙˙˙˙˙˙ 2143˙˙˙˙˙˙˙ 127654˙˙ 129796
k˙˙˙˙˙˙˙ 2953˙˙˙˙˙˙˙ 129797˙˙ 132749
l˙˙˙˙˙˙˙ 8200˙˙˙˙˙˙˙ 132750˙˙ 140949
m˙˙˙˙˙˙ 16709˙˙˙˙˙˙˙ 140950˙˙ 157658
n˙˙˙˙˙˙˙ 8430˙˙˙˙˙˙˙ 157659˙˙ 166088
o˙˙˙˙˙˙˙ 9771˙˙˙˙˙˙˙ 166089˙˙ 175859
p˙˙˙˙˙˙ 29745˙˙˙˙˙˙˙ 175860˙˙ 205604
q˙˙˙˙˙˙˙ 1474˙˙˙˙˙˙˙ 205605˙˙ 207078
r˙˙˙˙˙˙ 14084˙˙˙˙˙˙˙ 207079˙˙ 221162
s˙˙˙˙˙˙ 32516˙˙˙˙˙˙˙ 221163˙˙ 253678
t˙˙˙˙˙˙ 16091˙˙˙˙˙˙˙ 253679˙˙ 269769
u˙˙˙˙˙˙ 18167˙˙˙˙˙˙˙ 269770˙˙ 287936
v˙˙˙˙˙˙˙ 4566˙˙˙˙˙˙˙ 287937˙˙ 292502
w˙˙˙˙˙˙˙ 5382˙˙˙˙˙˙˙ 292503˙˙ 297884
x˙˙˙˙˙˙˙˙ 446˙˙˙˙˙˙˙ 297885˙˙ 298330
y˙˙˙˙˙˙˙˙ 919˙˙˙˙˙˙˙ 298331˙˙ 299249
z˙˙˙˙˙˙˙ 1148˙˙˙˙˙˙˙ 299250˙˙ 300397
--------------------------------------
So the start-end values become the range of randoms generated for that letter.
int r = (rand() % (end - start + 1)) + start;
If there are N words in total that start with 'c', say, then I just
generate a random number from 0 to N-1 (C), or 1 to N (M).
How do you address a word at position 99999 in the sorted list by using
0 or 1?
Altogether my program makes:
3 passes thru the 300398 words in:
˙˙ * 1 to count total words and words by letter
˙˙ * 1 to load the words into an array
˙˙ * 1 to find duplicates
2 passes thru the 2600 words out:
˙˙ * 1 to verify the 100 words per letter
˙˙ * 1 to print all 2600 words
5 total passes?˙ Not sure that's Ivy League.˙ But everything runs in
1/10th of a second so I can't complain.
In 0.25 seconds on my machine! This is why it can better to not use
the fastest machine around, then you can spot inefficiencies more easily.
I just added internal timing code to the C program:
1) loaded 300398 words in˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ 0.028 seconds
2) created 26 sets of 100 unique words in˙˙˙ 0.067 seconds
3) printed counts of words by letter in˙˙˙˙˙ 0.000 seconds
4) identified and printed duplicate words in 0.003 seconds
5) printed 2600 words in˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ 0.002 seconds
6) total run time is˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙˙ 0.101 seconds
Your C version was 4.5s. Mine are 3.x but they cap the duplicates at
100 so they can't be compared.
I did something wrong - the words output has 2 first letters.˙ Can you
spot where I messed up?
for x in {a..z}
do
grep ^$x words.uniq | shuf -n 100 | sed "s/^\(.\)/\L&\1/g"
done > words.temp
On Sun, 22 Mar 2026 23:21:43 +0000
Tristan Wibberley <tristan.wibberley+netnews2@alumni.manchester.ac.uk>
wrote:
On 22/03/2026 14:38, DFS wrote:
You must call a RNG 2600+ times to build the list
ie you can't use the
random ordering of the input file to your advantage).
The two are not the same, that is, the use of "ie" is wrong.
Which do you really require, or do you really require I satisfy the
conjunction of the two?
Do you try to hint that challenges with seemingly arbitrary rules and seemingly arbitrary purposes are not very worthy?
If yes, then you could as well say it directly.
Personally, I think the proposed challenge has some interesting
parts. Unfortunately, other parts are dumb or pointless or
needlessly tedious, which is disappointing.
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large
list of unsorted words possibly containing duplicates - extracts 26
sets of 100 random and unique words that each begin with a letter of
the English alphabet.
On 25/03/2026 12:32, Richard Harnden wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large
list of unsorted words possibly containing duplicates - extracts 26
sets of 100 random and unique words that each begin with a letter of
the English alphabet.
Here's my C attempt.
146 lines, but I like my vertical whitespace.
On 3/27/2026 1:24 PM, Richard Harnden wrote:
On 25/03/2026 12:32, Richard Harnden wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large
list of unsorted words possibly containing duplicates - extracts 26
sets of 100 random and unique words that each begin with a letter of
the English alphabet.
Here's my C attempt.
146 lines, but I like my vertical whitespace.
Thanks for the submission.
It's 106 lines of code, so it's the shortest yet.
The only part you didn't get quite right was:
"print the 2600 words you identify in column x row order in a grid of
˙size (200rows x 13cols or 300x9 or 400x7 or 500x6 or 600x4 etc) "
On 2026-03-28 05:52, DFS wrote:
On 3/27/2026 1:24 PM, Richard Harnden wrote:
On 25/03/2026 12:32, Richard Harnden wrote:
On 22/03/2026 14:38, DFS wrote:
---------------------
Objective
---------------------
deliver a C (and optional 2nd language) program that - from a large >>>>> list of unsorted words possibly containing duplicates - extracts 26 >>>>> sets of 100 random and unique words that each begin with a letter
of the English alphabet.
Here's my C attempt.
146 lines, but I like my vertical whitespace.
Thanks for the submission.
It's 106 lines of code, so it's the shortest yet.
The only part you didn't get quite right was:
"print the 2600 words you identify in column x row order in a grid of
˙˙size (200rows x 13cols or 300x9 or 400x7 or 500x6 or 600x4 etc) "
Ruminations on the recent "C" challenges...
Some requirements appear to be quite arbitrary.
But okay. When
I read about the tasks to implement the first thought that came
up was to use an appropriate language or tool-set, one that fits
better for the task, tasks that at least I consider annoying to
implement them in "C" because that language doesn't support it
well, because of C's primitivity (its low-level'ness). But okay;
we're in a C-group and the residents need feeding. - Why is it
that I consider it annoying in "C"? - Because I'd have liked to
implement such tasks based on existing _building blocks_; like
associative arrays, sensible array data types, and what not.
Instead of constructing and building a car with tools like an
axe and a stone, wouldn't it be a more sensible to create useful
tools in "C" to make such challenges concentrate more on the
actual problem than on how to reinvent the simplest tasks again
and again? - I'd certainly consider it worthwhile to challenge implementations of building blocks that alleviate C-programmers
from all the boring error-prone and low-level tasks that are
celebrated ad nauseam. - The question I'd ask myself if faced
with (arbitrary or useful) requirements would be what elementary
functions I'd need to construct the solution. Such identified
and isolated features, i.e. their implementation, would have a
persistent value for more than a single arbitrary "C" challenge.
Personally, I think the proposed challenge has some interesting
parts. Unfortunately, other parts are dumb or pointless or
needlessly tedious, which is disappointing.
On 3/28/2026 1:59 AM, Janis Papanagnou wrote:
[...]
But okay. When
I read about the tasks to implement the first thought that came
up was to use an appropriate language or tool-set, one that fits
better for the task, tasks that at least I consider annoying to
implement them in "C" because that language doesn't support it
well, because of C's primitivity (its low-level'ness). But okay;
we're in a C-group and the residents need feeding. - Why is it
that I consider it annoying in "C"? - Because I'd have liked to
implement such tasks based on existing _building blocks_; like
associative arrays, sensible array data types, and what not.
Instead of constructing and building a car with tools like an
axe and a stone, wouldn't it be a more sensible to create useful
tools in "C" to make such challenges concentrate more on the
actual problem than on how to reinvent the simplest tasks again
and again? - I'd certainly consider it worthwhile to challenge
implementations of building blocks that alleviate C-programmers
from all the boring error-prone and low-level tasks that are
celebrated ad nauseam. - The question I'd ask myself if faced
with (arbitrary or useful) requirements would be what elementary
functions I'd need to construct the solution. Such identified
and isolated features, i.e. their implementation, would have a
persistent value for more than a single arbitrary "C" challenge.
I think one word would've sufficed where you used 235: python
[...]
On 2026-03-28 14:05, DFS wrote:
On 3/28/2026 1:59 AM, Janis Papanagnou wrote:
[...]
But okay. When
I read about the tasks to implement the first thought that came
up was to use an appropriate language or tool-set, one that fits
better for the task, tasks that at least I consider annoying to
implement them in "C" because that language doesn't support it
well, because of C's primitivity (its low-level'ness). But okay;
we're in a C-group and the residents need feeding. - Why is it
that I consider it annoying in "C"? - Because I'd have liked to
implement such tasks based on existing _building blocks_; like
associative arrays, sensible array data types, and what not.
Instead of constructing and building a car with tools like an
axe and a stone, wouldn't it be a more sensible to create useful
tools in "C" to make such challenges concentrate more on the
actual problem than on how to reinvent the simplest tasks again
and again? - I'd certainly consider it worthwhile to challenge
implementations of building blocks that alleviate C-programmers
from all the boring error-prone and low-level tasks that are
celebrated ad nauseam. - The question I'd ask myself if faced
with (arbitrary or useful) requirements would be what elementary
functions I'd need to construct the solution. Such identified
and isolated features, i.e. their implementation, would have a
persistent value for more than a single arbitrary "C" challenge.
I think one word would've sufficed where you used 235: python
Sorry, I cannot associate that statement with anything I said. -
What is that "235: python" referring to? - Mind to elaborate?
On 2026-03-28 14:05, DFS wrote:...
I think one word would've sufficed where you used 235: python
Sorry, I cannot associate that statement with anything I said. -
What is that "235: python" referring to? - Mind to elaborate?
On 2026-03-30 05:26, Janis Papanagnou wrote:
On 2026-03-28 14:05, DFS wrote:...
I think one word would've sufficed where you used 235: python
Sorry, I cannot associate that statement with anything I said. -
What is that "235: python" referring to? - Mind to elaborate?
I cannot answer your question, but the way you worded it suggests to me
that you may have parsed his comment incorrectly.
It should be parsed as
"I think one word would've sufficed where you used 235. That word is
python."
On 30/03/2026 10:26, Janis Papanagnou wrote:
On 2026-03-28 14:05, DFS wrote:
On 3/28/2026 1:59 AM, Janis Papanagnou wrote:
[...]
But okay. When
I read about the tasks to implement the first thought that came
up was to use an appropriate language or tool-set, one that fits
better for the task, tasks that at least I consider annoying to
implement them in "C" because that language doesn't support it
well, because of C's primitivity (its low-level'ness). But okay;
we're in a C-group and the residents need feeding. - Why is it
that I consider it annoying in "C"? - Because I'd have liked to
implement such tasks based on existing _building blocks_; like
associative arrays, sensible array data types, and what not.
Instead of constructing and building a car with tools like an
axe and a stone, wouldn't it be a more sensible to create useful
tools in "C" to make such challenges concentrate more on the
actual problem than on how to reinvent the simplest tasks again
and again? - I'd certainly consider it worthwhile to challenge
implementations of building blocks that alleviate C-programmers
from all the boring error-prone and low-level tasks that are
celebrated ad nauseam. - The question I'd ask myself if faced
with (arbitrary or useful) requirements would be what elementary
functions I'd need to construct the solution. Such identified
and isolated features, i.e. their implementation, would have a
persistent value for more than a single arbitrary "C" challenge.
I think one word would've sufficed where you used 235: python
Sorry, I cannot associate that statement with anything I said. -
What is that "235: python" referring to? - Mind to elaborate?
The 235 refers to the number of words in your paragraph (I haven't
checked).
On Mon, 30 Mar 2026 12:10:46 +0100
Bart <bc@freeuk.com> wrote:
On 30/03/2026 10:26, Janis Papanagnou wrote:
On 2026-03-28 14:05, DFS wrote:
On 3/28/2026 1:59 AM, Janis Papanagnou wrote:
[...]
But okay. When
I read about the tasks to implement the first thought that came
up was to use an appropriate language or tool-set, one that fits
better for the task, tasks that at least I consider annoying to
implement them in "C" because that language doesn't support it
well, because of C's primitivity (its low-level'ness). But okay;
we're in a C-group and the residents need feeding. - Why is it
that I consider it annoying in "C"? - Because I'd have liked to
implement such tasks based on existing _building blocks_; like
associative arrays, sensible array data types, and what not.
Instead of constructing and building a car with tools like an
axe and a stone, wouldn't it be a more sensible to create useful
tools in "C" to make such challenges concentrate more on the
actual problem than on how to reinvent the simplest tasks again
and again? - I'd certainly consider it worthwhile to challenge
implementations of building blocks that alleviate C-programmers
from all the boring error-prone and low-level tasks that are
celebrated ad nauseam. - The question I'd ask myself if faced
with (arbitrary or useful) requirements would be what elementary
functions I'd need to construct the solution. Such identified
and isolated features, i.e. their implementation, would have a
persistent value for more than a single arbitrary "C" challenge.
I think one word would've sufficed where you used 235: python
Sorry, I cannot associate that statement with anything I said. -
What is that "235: python" referring to? - Mind to elaborate?
The 235 refers to the number of words in your paragraph (I haven't
checked).
I did. There are 223 words.
So, now I have more interesting question - how did DFS come to number
235? If by eye sight - it's impressively precise. If by use of word
count utility - it's too imprecise.
On 31/03/2026 10:11, Michael S wrote:
On Mon, 30 Mar 2026 12:10:46 +0100
Bart <bc@freeuk.com> wrote:
On 30/03/2026 10:26, Janis Papanagnou wrote:
On 2026-03-28 14:05, DFS wrote:
On 3/28/2026 1:59 AM, Janis Papanagnou wrote:
[...]
But okay. When
I read about the tasks to implement the first thought that came
up was to use an appropriate language or tool-set, one that fits
better for the task, tasks that at least I consider annoying to
implement them in "C" because that language doesn't support it
well, because of C's primitivity (its low-level'ness). But okay;
we're in a C-group and the residents need feeding. - Why is it
that I consider it annoying in "C"? - Because I'd have liked to
implement such tasks based on existing _building blocks_; like
associative arrays, sensible array data types, and what not.
Instead of constructing and building a car with tools like an
axe and a stone, wouldn't it be a more sensible to create useful
tools in "C" to make such challenges concentrate more on the
actual problem than on how to reinvent the simplest tasks again
and again? - I'd certainly consider it worthwhile to challenge
implementations of building blocks that alleviate C-programmers
from all the boring error-prone and low-level tasks that are
celebrated ad nauseam. - The question I'd ask myself if faced
with (arbitrary or useful) requirements would be what elementary
functions I'd need to construct the solution. Such identified
and isolated features, i.e. their implementation, would have a
persistent value for more than a single arbitrary "C"
challenge.
I think one word would've sufficed where you used 235: python
Sorry, I cannot associate that statement with anything I said. -
What is that "235: python" referring to? - Mind to elaborate?
The 235 refers to the number of words in your paragraph (I haven't
checked).
I did. There are 223 words.
So, now I have more interesting question - how did DFS come to
number 235? If by eye sight - it's impressively precise. If by use
of word count utility - it's too imprecise.
OK, now I have to count them! If I use 'wc' on the original paragraph
that JP wrote, which starts like this:
Some requirements appear to be quite arbitrary. But okay. ...
Then it says 230 words. But there was also another line before that paragraph which was this:
Ruminations on the recent "C" challenges...
If that is included, then 'wc' reports 236 words. (It's also possible
that DFS mistyped the value.)
Presumably your count starts from 'But okay;'; then I get 223 words
too.
I did. There are 223 words.
So, now I have more interesting question - how did DFS come to number
235? If by eye sight - it's impressively precise. If by use of word
count utility - it's too imprecise.
On 3/31/2026 5:11 AM, Michael S wrote:
I did. There are 223 words.
So, now I have more interesting question - how did DFS come to
number 235? If by eye sight - it's impressively precise. If by use
of word count utility - it's too imprecise.
Starting with "But okay. When", I counted on my fingers while moving
my lips. Lost count several times before I dropped it into Notepad++
and did View | Summary.
On Tue, 31 Mar 2026 13:07:48 -0400
DFS <nospam@dfs.com> wrote:
On 3/31/2026 5:11 AM, Michael S wrote:
I did. There are 223 words.
So, now I have more interesting question - how did DFS come to
number 235? If by eye sight - it's impressively precise. If by use
of word count utility - it's too imprecise.
Starting with "But okay. When", I counted on my fingers while moving
my lips. Lost count several times before I dropped it into Notepad++
and did View | Summary.
Now I know that Notepad++ has View | Summary. Thank you.
On 3/31/2026 2:15 PM, Michael S wrote:
On Tue, 31 Mar 2026 13:07:48 -0400
DFS <nospam@dfs.com> wrote:
On 3/31/2026 5:11 AM, Michael S wrote:
I did. There are 223 words.
So, now I have more interesting question - how did DFS come to
number 235? If by eye sight - it's impressively precise. If by use
of word count utility - it's too imprecise.
Starting with "But okay. When", I counted on my fingers while
moving my lips. Lost count several times before I dropped it into
Notepad++ and did View | Summary.
Now I know that Notepad++ has View | Summary. Thank you.
But if I use Notepad++ and replace every space with a \n I get 223
words. Difference of 12. Strange.
Google AI Mode says:
"Notepad++'s View | Summary (or double-clicking the status bar) is
known to produce inaccurate word counts because it uses a simplified algorithm that often misinterprets punctuation, special characters,
and encodings as word boundaries. It is widely considered "totally
wrong" for precise work.
Recommended Workarounds
For an accurate word count, use these more reliable methods:
Regex Count (Most Accurate):
Press Ctrl + F and go to the Mark or Find tab.
In Find what, type: \w+ (this matches alphanumeric word characters).
Set the Search Mode to Regular expression.
Click Count (or Mark All). The accurate word count will appear in the
status bar of that window.
Counting Selected Text Only:
To count a specific section, highlight the text and follow the Regex
Count steps above, making sure to check the In selection box.
Plugins:
NppTextFX2: This updated plugin provides a dedicated "Word Count"
tool under TextFX > TextFX Tools.
PythonScript: Advanced users can use a script (like
StatusBarWordCount) to display a live, accurate count in the status
bar.
Why "Summary" is Inaccurate
Encoding Issues: It may miscount characters in specific encodings
like UCS-2.
Word Definition: Unlike a full word processor, the Summary feature's
basic definition of a "word" often fails to handle contractions (like "don't") or hyphenated words correctly.
Hidden Spaces: It sometimes overcounts by treating multiple spaces or
line returns as extra word breaks."
Overcounting by 12 from a 223-word paragraph is ridiculously wrong.
I'm surprised, since Notepad++ is otherwise a great editor.
Note: if I use the AI suggestion of "Regex Count", it also says 235
words.
223 it is.
Now I had unknown that Notepad++ has View | Summary. Thank you.
| Sysop: | Tetrazocine |
|---|---|
| Location: | Melbourne, VIC, Australia |
| Users: | 14 |
| Nodes: | 8 (0 / 8) |
| Uptime: | 93:57:36 |
| Calls: | 211 |
| Files: | 21,502 |
| Messages: | 82,381 |