Articles

Make iManage REST API requests 30x faster in Python

Written by Wiktor Abramowicz | May 28, 2024 9:31:12 AM

Here's a brief article on how I enhanced my Python script to update 63 documents per second instead of just 2.

 

WARNING

  • The script includes multiprocessing functions that I don't fully comprehend, so it should be used at your own risk.
  • The script is straightforward; I cannot assure that this method will work with more complex tasks.
  • Rapidly updating a large number of items may overload the indexer and refiler.
  • This script has only been tested with cloudimanage.

With the warning addressed, let's dive into the story.

 

STORY

I was charged with updating classifications on nearly every document within a cloudimanage environment. Searches using the "class" filter were timing out, so I opted to review all documents and emails, processing only the necessary ones (about 95% of all documents).

My original script was as follows:

 

access_token = authenticate()
 
mapping_table = read_csv_into_dict('mappingClasses.csv')
 
documents = get_documents(access_token,...)
 
for document in documents:
update_response = update_doc(doc_id,...)

 

 

However, this method was too sluggish, managing only about 2 documents per second.

Facing a go-live date less than a week away, we couldn't tolerate such a slow pace. It would take 46 days to update 8 million documents.

 

SOLUTION

I explored multithreading, multiprocessing, and asyncio. It seems asyncio is preferred when REST API response time is the limiting factor. Unfortunately, I just couldn't figure it out.

So, I turned to the next best option – multiprocessing. I selected it for three reasons:

  • It was easier to understand,
  • It could utilize multiple CPU cores,
  • It avoided the limitations of Python's Global Interpreter Lock.

I ended up with the following script (pseudocode):

 

from multiprocessing import Pool
   
access_token = authenticate()
 
mapping_table = read_csv_into_dict('mappingClasses.csv')
 
documents = get_documents(access_token,...)
 
p = Pool(16)
 
for document in documents:
 
result = p.apply_async(process_doc, args=(doc_id, ...))
 
p.close()
p.join()
 
 
 

A few things that made it work:

  • Each PATCH request operated independently,
  • API responses were logged only,
  • The whole logic of class identification, document update and logging was combined into one function – process_doc()

The script ran on a laptop with 16GB of RAM and a 4 Core/8 Thread CPU at 2.8 GHz. All 8 logical processors were engaged, with average CPU usage hitting 80%.

 

CONCLUSION

I omitted many details on purpose. I didn’t attach any source code either.

I hope you now know how to use Pool() to utilize all your CPU cores to process iManage REST API requests faster.