In a few simple steps you can remove the co-authoring section from MS Word. Similar steps can be...
Make iManage REST API requests 30x faster in Python
Here's a brief article on how I enhanced my Python script to update 63 documents per second instead of just 2.
WARNING
- The script includes multiprocessing functions that I don't fully comprehend, so it should be used at your own risk.
- The script is straightforward; I cannot assure that this method will work with more complex tasks.
- Rapidly updating a large number of items may overload the indexer and refiler.
- This script has only been tested with cloudimanage.
With the warning addressed, let's dive into the story.
STORY
I was charged with updating classifications on nearly every document within a cloudimanage environment. Searches using the "class" filter were timing out, so I opted to review all documents and emails, processing only the necessary ones (about 95% of all documents).
My original script was as follows:
access_token = authenticate()
mapping_table = read_csv_into_dict('mappingClasses.csv')
documents = get_documents(access_token,...)
for document in documents:
update_response = update_doc(doc_id,...)
However, this method was too sluggish, managing only about 2 documents per second.
Facing a go-live date less than a week away, we couldn't tolerate such a slow pace. It would take 46 days to update 8 million documents.
SOLUTION
I explored multithreading, multiprocessing, and asyncio. It seems asyncio is preferred when REST API response time is the limiting factor. Unfortunately, I just couldn't figure it out.
So, I turned to the next best option – multiprocessing. I selected it for three reasons:
- It was easier to understand,
- It could utilize multiple CPU cores,
- It avoided the limitations of Python's Global Interpreter Lock.
I ended up with the following script (pseudocode):
from multiprocessing import Pool
access_token = authenticate()
mapping_table = read_csv_into_dict('mappingClasses.csv')
documents = get_documents(access_token,...)
p = Pool(16)
for document in documents:
result = p.apply_async(process_doc, args=(doc_id, ...))
p.close()
p.join()
A few things that made it work:
- Each PATCH request operated independently,
- API responses were logged only,
- The whole logic of class identification, document update and logging was combined into one function – process_doc()
The script ran on a laptop with 16GB of RAM and a 4 Core/8 Thread CPU at 2.8 GHz. All 8 logical processors were engaged, with average CPU usage hitting 80%.
CONCLUSION
I omitted many details on purpose. I didn’t attach any source code either.
I hope you now know how to use Pool() to utilize all your CPU cores to process iManage REST API requests faster.