Looping Efficiency In Python
Solution 1:
You're not utilizing the full power of modern computers, that have multiple central processing units! This is by far the best optimization you can have here, since this is CPU bound. Note: for I/O bound operations multithreading (using the threading module) is suitable.
So let's see how python makes it easy to do so using multiprocessing module (read comments):
import hashlib
# you're sampling a string so you need sample, not 'choice'from random import sample
import multiprocessing
# use a thread to synchronize writing to fileimport threading
# open up to 4 processes per cpu
processes_per_cpu = 4
processes = processes_per_cpu * multiprocessing.cpu_count()
print"will use %d processes" % processes
longitud = 8
valores = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"# check on smaller ranges to compare before trying your range... :-)
RANGE = 200000defenc(string):
m = hashlib.md5()
m.update(string.encode('utf-8'))
return m.hexdigest()
# we synchronize the results to be written using a queue shared by processes
q = multiprocessing.Manager().Queue()
# this is the single point where results are written to the file# the file is opened ONCE (you open it on every iteration, that's bad)defwrite_results():
withopen('datos.txt', 'w') as f:
whileTrue:
msg = q.get()
if msg == 'close':
break;
else:
f.write(msg)
# this is the function each process uses to calculate a single resultdefcalc_one(i):
s = ''.join(sample(valores, longitud))
md = enc(s)
q.put("%s %s\n" % (s, md))
# we start a process pool of workers to spread work and not rely on# a single cpu
pool = multiprocessing.Pool(processes=processes)
# this is the thread that will write the results coming from# other processes using the queue, so it's execution target is write_results
t = threading.Thread(target=write_results)
t.start()
# we use 'map_async' to not block ourselves, this is redundant here,# but it's best practice to use this when you don't HAVE to block ('pool.map')
pool.map_async(calc_one, xrange(RANGE))
# wait for completion
pool.close()
pool.join()
# tell result-writing thread to stop
q.put('close')
t.join()
There are probably more optimizations to be done in this code, but a major optimization for any cpu-bound task like you present, is using multiprocessing.
note: A trivial optimization of file writes would be to aggregate some results from the queue and write them together (if you have many cpus that exceed the single writing thread's speed)
note 2: Since OP was looking to go over combinations/permutations of stuff, it should be noted that there is a module for doing just that, and it's called itertools.
Solution 2:
Note that you should use
forcodin itertools.product(valores, longitud):
rather than picking the strings via random.sample
as this will only ever visit a given string once.
Also note that for your given values this loop has 218340105584896 iterations. And the output file will occupy 9170284434565632 bytes or 8PB.
Solution 3:
Profile your program first (with the cProfile
module: https://docs.python.org/2/library/profile.html and http://ymichael.com/2014/03/08/profiling-python-with-cprofile.html), but I'm willing to bet your program is IO-bound (if your CPU usage never reaches 100% on one core, it means your hard drive is too slow to keep up with the execution speed of the rest of the program).
With that in mind, start by changing your program so that:
- It opens and closes the file outside of the loop (opening and closing files is super slow).
- It only makes one
write
call in each iteration (those each translate to a syscall, which are expensive), like so:f.write("%s %s\n" % (cod, md))
Solution 4:
Although it helps with debugging, I have found printing makes a program run slower, so maybe don't quite print as much out. Also I'd move the "f=open('datos.txt', 'a') out from the loop as I can imagine Opening the same file over and over again might cause some time issues, and then move the "f.close()" out of the loop also to the end of the program.
Post a Comment for "Looping Efficiency In Python"