We have a console command at Openlittermap that takes all the litter locations in the world and groups them into clusters for viewing on the global map. It’s a heavy command in terms of memory consumption, database strain, and execution time. That’s why it’s used sparingly, from time to time. Let’s see how we can make it run daily, without sacrificing time and memory.
It’s going to be a bit long, so bear with me here. For brevity, we’re only showing the
handle method. Being a lengthy method, it's hard to find a place to start refactoring, so let's go easy and extract some smaller methods out of it. This will help us understand the problem better, and improve the command's readability.
Refactor into smaller methods
There are two main actions that this command does. Firstly, it generates a JSON file with all the latitude & longitude values of litter, hence, the
generateFeatures method. Secondly, using this JSON file, a node command (yes, a node command) is executed using the
exec PHP function, which generates the clusters into a
clusters.json file. Lastly, we populate our
clusters database table using the generated JSON. I know, lengthy explanation.
Use cursor() instead of get()
The memory issue with this line
Photo::select('lat', 'lon')->get(); is that it will load all the photo objects into memory, and that is a big problem with the 250k+ photos currently in the database. To fix that, Laravel provides a nice query helper called cursor. By using
$photos = Photo::select('lat', 'lon')->cursor();, only one Eloquent model is kept in memory at any given time while iterating over the cursor. For that reason, we don't even have to call the
unset($photos); method at all.
From the docs, although the cursor method uses far less memory than a regular query, it will still eventually run out of memory. If you’re dealing with a very large number of Eloquent records, consider using the lazy method instead.
Use batches instead of individual inserts
Looking at the loop that iterates over all clusters, it’s killing the database with thousands of MySql
insert queries for storing the clusters. It would be much better if we utilized the
Cluster::insert() method instead of
Cluster::create(). This way, we'll only execute one query to insert all the records.
Now this will of course fail because inserting lots of megabytes at once is a big no-no for the database. Let’s see how we can insert the data in chunks.
We’re doing a couple of things here. First, we’re wrapping the clusters object into a Laravel
Collection instance, so that the code becomes more readable and we're provided some nice helpers. Using the
filter() method on the collection, we're extracting the check with
isset() outside of the mapping (or looping). The call to
map() on the collection allows us to modify the
$cluster objects into simple arrays, preparing them for insertion.
Now here’s the fun part. The call to
->chunk(1000) will separate the collection items into multiple chunks of size 1000. The call to
each() after that iterates all the chunks and inserts them separately. This allows us to easily execute a much smaller number of queries, without sacrificing performance.
Bonus small improvements
There are three small improvements I’d like to mention. Just to feel… complete.
The code used for illustration is taken from the OpenLitterMap project. They’re doing a great job creating the world’s most advanced open database on litter, brands & plastic pollution. The project is open-sourced and would love your contributions, both as users and developers.
Originally published at https://genijaho.dev.