Home Artificial Intelligence Use Python to Download Multiple Files (or URLs) in Parallel

Use Python to Download Multiple Files (or URLs) in Parallel

0
Use Python to Download Multiple Files (or URLs) in Parallel

Get more data in less time

Photo by Wesley Tingey on Unsplash

We live in a world of massive data. Often, big data is organized as a big collection of small datasets (i.e., one large dataset comprised of multiple files). Obtaining these data is commonly frustrating due to download (or acquisition burden). Fortunately, with a little bit code, there are methods to automate and speed-up file download and acquisition.

Automating file downloads can save loads of time. There are several ways to automate file downloads with Python. The simplest option to download files is using a straightforward Python loop to iterate through a listing of URLs to download. This serial approach can work well with a number of small files, but should you are downloading many files or large files, you’ll need to use a parallel approach to maximise your computational resources.

With a parallel file download routine, you’ll be able to higher use your computer’s resources to download multiple files concurrently, saving you time. This tutorial demonstrates the best way to develop a generic file download function in Python and apply it to download multiple files with serial and parallel approaches. The code on this tutorial uses only modules available from the Python standard library, so no installations are required.

For this instance, we only need the requests and multiprocessing Python modules to download files in parallel. The requests and multiprocessing modules are each available from the Python standard library, so you will not must perform any installations.

We’ll also import the time module to maintain track of how long it takes to download individual files and compare performance between the serial and parallel download routines. The time module can also be a part of the Python standard library.

import requests import time from multiprocessing import cpu_count from multiprocessing.pool import ThreadPool

I’ll reveal parallel file downloads in Python using gridMET NetCDF files that contain every day precipitation data for america.

Here, I specify the URLs to 4 files in a listing. In other applications, it’s possible you’ll programmatically generate a listing of files to download.

urls = ['https://www.northwestknowledge.net/metdata/data/pr_1979.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1980.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1981.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1982.nc']

Each URL should be related to its download location. Here, I’m downloading the files to the Windows ‘Downloads’ directory. I’ve hardcoded the filenames in a listing for simplicity and transparency. Given your application, it’s possible you’ll want to write down code that can parse the input URL and download it to a selected directory.

fns = [r'C:UserskonradDownloadspr_1979.nc', r'C:UserskonradDownloadspr_1980.nc', r'C:UserskonradDownloadspr_1981.nc', r'C:UserskonradDownloadspr_1982.nc']

Multiprocessing requires parallel functions to have just one argument (there are some workarounds, but we won’t get into that here). To download a file we’ll must pass two arguments, a URL and a filename. So we’ll zip the urls and fns lists together to get a listing of tuples. Each tuple within the list will contain two elements; a URL and the download filename for the URL. This manner we are able to pass a single argument (the tuple) that accommodates two pieces of data.

inputs = zip(urls, fns)

Now that now we have specified the URLs to download and their associated filenames, we want a function to download the URLs ( download_url).

We’ll pass one argument ( arg) to download_url. This argument will probably be an iterable (list or tuple) where the primary element is the URL to download ( url) and the second element is the filename ( fn). The weather are assigned to variables ( url and fn) for readability.

Now create a try statement wherein the URL is retrieved and written to the file after it’s created. When the file is written the URL and download time are returned. If an exception occurs a message is printed.

The download_url function is the meat of our code. It does the actual work of downloading and file creation. We are able to now use this function to download files in serial (using a loop) and in parallel. Let’s undergo those examples.

def download_url(args): 
t0 = time.time()
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, 'wb') as f:
f.write(r.content)
return(url, time.time() - t0)
except Exception as e:
print('Exception in download_url():', e)

To download the list of URLs to the associated files, loop through the iterable ( inputs) that we created, passing each element to download_url. After each download is complete we are going to print the downloaded URL and the time it took to download.

The whole time to download all URLs will print in spite of everything downloads have been accomplished.

t0 = time.time() 
for i in inputs:
result = download_url(i)
print('url:', result[0], 'time:', result[1])
print('Total time:', time.time() - t0)

Output:

url: https://www.northwestknowledge.net/metdata/data/pr_1979.nc time: 16.381176710128784 
url: https://www.northwestknowledge.net/metdata/data/pr_1980.nc time: 11.475878953933716
url: https://www.northwestknowledge.net/metdata/data/pr_1981.nc time: 13.059367179870605
url: https://www.northwestknowledge.net/metdata/data/pr_1982.nc time: 12.232381582260132
Total time: 53.15849542617798

It took between 11 and 16 seconds to download the person files. The whole download time was a little bit lower than one minute. Your download times will vary based in your specific network connection.

Let’s compare this serial (loop) approach to the parallel approach below.

To start out, create a function ( download_parallel) to handle the parallel download. The function ( download_parallel) will take one argument, an iterable containing URLs and associated filenames (the inputs variable we created earlier).

Next, get the variety of CPUs available for processing. It will determine the variety of threads to run in parallel.

Now use the multiprocessing ThreadPool to map the inputs to the download_url function. Here we use the imap_unordered approach to ThreadPool and pass it the download_url function and input arguments to download_url (the inputs variable). The imap_unordered method will run download_url concurrently for the variety of specified threads (i.e. parallel download).

Thus, if now we have 4 files and 4 threads all files could be downloaded at the identical time as a substitute of waiting for one download to complete before the subsequent starts. This may save a substantial amount of processing time.

In the ultimate a part of the download_parallel function the downloaded URLs and the time required to download each URL are printed.

def download_parallel(args): 
cpus = cpu_count()
results = ThreadPool(cpus - 1).imap_unordered(download_url, args)
for end in results:
print('url:', result[0], 'time (s):', result[1])

Once the inputs and download_parallel are defined, the files could be downloaded in parallel with a single line of code.

download_parallel(inputs)

Output:

url: https://www.northwestknowledge.net/metdata/data/pr_1980.nc time (s): 14.641696214675903 
url: https://www.northwestknowledge.net/metdata/data/pr_1981.nc time (s): 14.789752960205078
url: https://www.northwestknowledge.net/metdata/data/pr_1979.nc time (s): 15.052601337432861
url: https://www.northwestknowledge.net/metdata/data/pr_1982.nc time (s): 23.287317752838135
Total time: 23.32273244857788

Notice that it took longer to download each individual file with the approach. This may increasingly be a result of fixing network speed, or overhead required to map the downloads to their respective threads. Although the person files took longer to download, the parallel method resulted in a 50% decrease in total download time.

You’ll be able to see how parallel processing can greatly reduce processing time for multiple files. Because the variety of files increases, you’ll save way more time by utilizing a parallel download approach.

Automating file downloads in your development and evaluation routines can prevent loads of time. As demonstrated by this tutorial implementing a parallel download routine can greatly decrease file acquisition time should you require many files or large files.

LEAVE A REPLY

Please enter your comment!
Please enter your name here