Hi Greg... I was going to ask you about that :
-- collecting a group of 200 values and then performing some math on them like stdev, or a average, etc.
How did you come to the chunk size of 200 ...or was it just a guess ?
As for what I've been reading - beside coding some small benchmarks virual test. (data stored in memory not file)
There is a termendous difference in speed processing if data comes from your Cpu L1, L2, L3 memory ...compared to your MB Ram memory.
So, your latest idea to test smaller chunck would most likely help in this process of avoiding to load too much data that is needed from your SSD at some point, read is fast compared to writting.
Any process will be executing faster if it can be loaded as best as possible into your L1 cache memory.
Optimizing for what the L1 can handle is the best !.
So one want to avoid request back any data to a slower component down the road as much as possible, mostly RAM or SSD.
Optimizing execution at this level requires getting to know your own CPU(s) capabilities and how the whole CPU Cache L1-L3 system works. Taking your L1 cache size into account to "code" your processing will give you the best results you can possibly achieve on that machine
There is probably specialized tools out there who could help you knowing more precisely how big of a "chunk" would best fit into your "Cpu/cache/ram" according to the data you submit - but meanwhile benchmarking how each process - from little to bigger chuncks - can help.
Now about the "processing order",there is a dif. between processing numbers and characters inside the CPU...so.. It also would be interesting to compare - in your case - the speed between processing one type at a time - if doable - operations like running first :
- processing of all dates, and "write" that (to memory or ssd).
- next...just doing the substraction of the 2 mys..and write that again...
what happen is CPUs have many smart processes and tends to detect what type of operation is being submited to them then optimize that process for execution with the L1 cache...therefore, if the Type AND the Lenght of data, changes often...as many time the Cpu working with Cache will need to adapt to.
There is so much interesting stuff to learn about this ! (not to talk about parallel processing, right ?)
- Have you had a chance to check how good your Cpus cores are involved while you are processing your data ?
Some system will place some Cores on rest, to cool them down, nothing wrong.
*see video at 9:05
https://www.youtube.com/watch?v=IAkj32VPcUE