With the warning addressed, let's dive into the story.
I was charged with updating classifications on nearly every document within a cloudimanage environment. Searches using the "class" filter were timing out, so I opted to review all documents and emails, processing only the necessary ones (about 95% of all documents).
My original script was as follows:
access_token = authenticate()
mapping_table = read_csv_into_dict('mappingClasses.csv')
documents = get_documents(access_token,...)
for document in documents:
update_response = update_doc(doc_id,...)
However, this method was too sluggish, managing only about 2 documents per second.
Facing a go-live date less than a week away, we couldn't tolerate such a slow pace. It would take 46 days to update 8 million documents.
I explored multithreading, multiprocessing, and asyncio. It seems asyncio is preferred when REST API response time is the limiting factor. Unfortunately, I just couldn't figure it out.
So, I turned to the next best option – multiprocessing. I selected it for three reasons:
I ended up with the following script (pseudocode):
from multiprocessing import Pool
access_token = authenticate()
mapping_table = read_csv_into_dict('mappingClasses.csv')
documents = get_documents(access_token,...)
p = Pool(16)
for document in documents:
result = p.apply_async(process_doc, args=(doc_id, ...))
p.close()
p.join()
A few things that made it work:
The script ran on a laptop with 16GB of RAM and a 4 Core/8 Thread CPU at 2.8 GHz. All 8 logical processors were engaged, with average CPU usage hitting 80%.
I omitted many details on purpose. I didn’t attach any source code either.
I hope you now know how to use Pool() to utilize all your CPU cores to process iManage REST API requests faster.