Skip to content

Conversation

@Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Nov 11, 2025

Description

  • This PR adds new buffer tables to improve handling of metadata records. The key change is that updates to metadata are now accumulated in buffer and applied when get_metadata is called. With the old behavior, metadata records were updated instantly within a transaction. This led to waiting for locks to be released in high-concurrency situations.

Issues

@Mantisus Mantisus self-assigned this Nov 11, 2025
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please PR type without an exclamation mark

@Mantisus Mantisus changed the title perf!: Optimize metadata records processing in 'SqlStorageClient` perf: Optimize metadata records processing in 'SqlStorageClient` Nov 11, 2025
@Mantisus Mantisus marked this pull request as ready for review November 17, 2025 14:03
@Mantisus Mantisus requested review from janbuchar and vdusek November 17, 2025 14:03
@janbuchar
Copy link
Collaborator

Interesting! I'd imagine that transactions consisting of e.g., an insertion to the dataset_items table and an update to dataset metadata wouldn't lock the metadata table for that long - you can commit right after the update to metadata.

Also, the buffering approach is faster because the buffer table gets a row for each increment and those get compacted later on, correct?

@Mantisus
Copy link
Collaborator Author

update to dataset metadata wouldn't lock the metadata table for that long

They will create many short-lived locks. And with a large number of clients with high concurrency inserting new records, this effect will accumulate.
This is exactly what @ericvg97 pointed out - #1533 (comment)

Although, of course, the strongest impact on RequestQueue

Yes, insert operations into the buffer table are quite fast. And then we can simply apply the result of the aggregations to update the metadata record.

@Mantisus Mantisus changed the title perf: Optimize metadata records processing in 'SqlStorageClient` perf: Optimize metadata records processing in SqlStorageClient Nov 18, 2025
@janbuchar
Copy link
Collaborator

update to dataset metadata wouldn't lock the metadata table for that long

They will create many short-lived locks. And with a large number of clients with high concurrency inserting new records, this effect will accumulate. This is exactly what @ericvg97 pointed out - #1533 (comment)

Although, of course, the strongest impact on RequestQueue

I see, thanks. And is there any chance that the lock is held for too long because of how we work with sqlalchemy? In other words, would it be better if we just executed sql such as insert ...; update ...; commit in one go? If yes, it might be worth trying before adding three new tables to the whole thing.

@Mantisus
Copy link
Collaborator Author

it might be worth trying before adding three new tables to the whole thing.

I will test this approach.

@Mantisus Mantisus marked this pull request as draft November 26, 2025 16:36
@Mantisus
Copy link
Collaborator Author

I will test this approach.

Unfortunately, switching to queries such as insert ...; update ...; commit does not improve the situation. Concurrent access to metadata records remains a bottleneck.

So far, switching to additional tables with on-demand metadata processing has shown the best results.
This also opens up the possibility of integrating MySQL, provided that the DBMS is running with transaction_isolation=READ-COMMITTED.

@Mantisus Mantisus marked this pull request as ready for review January 26, 2026 10:21
@vdusek vdusek requested a review from Copilot January 26, 2026 13:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes metadata record processing in SqlStorageClient by introducing buffer tables to defer metadata updates and reduce database lock contention in high-concurrency scenarios. Instead of updating metadata instantly within transactions, updates are now accumulated in buffer tables and applied when get_metadata is called.

Changes:

  • Added buffer tables (dataset_metadata_buffer, key_value_store_metadata_buffer, request_queue_metadata_buffer) to accumulate metadata updates
  • Implemented buffer processing mechanism with locking to prevent concurrent buffer processing
  • Increased database connection pool settings to handle higher concurrency

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/crawlee/storage_clients/_sql/_storage_client.py Removed _accessed_modified_update_interval mechanism; increased connection pool sizes
src/crawlee/storage_clients/_sql/_db_models.py Added buffer table models and buffer_locked_until field to metadata tables; added indexes
src/crawlee/storage_clients/_sql/_client_mixin.py Implemented core buffer processing logic including _add_buffer_record, _process_buffers, and lock management
src/crawlee/storage_clients/_sql/_request_queue_client.py Integrated buffer system; replaced direct metadata updates with buffer records
src/crawlee/storage_clients/_sql/_key_value_store_client.py Integrated buffer system for KVS operations
src/crawlee/storage_clients/_sql/_dataset_client.py Integrated buffer system for dataset operations
tests/unit/storage_clients/_sql/test_sql_rq_client.py Removed test fixtures that modified _accessed_modified_update_interval
tests/unit/storage_clients/_sql/test_sql_kvs_client.py Removed test fixtures that modified _accessed_modified_update_interval
tests/unit/storage_clients/_sql/test_sql_dataset_client.py Removed test fixtures that modified _accessed_modified_update_interval
docs/guides/storage_clients.mdx Updated ER diagrams to include new buffer tables

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Mantisus and others added 3 commits January 26, 2026 15:16
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Comments suppressed due to low confidence (1)

src/crawlee/storage_clients/_sql/_dataset_client.py:290

  • The docstring mentions a session parameter that doesn't exist in the method signature. This appears to be leftover documentation from before the refactoring. The parameter should be removed from the docstring to match the actual method signature.
            session: The SQLAlchemy AsyncSession to use for the update.
            new_item_count: If provided, set item count to this value.
            delta_item_count: If provided, add this value to the current item count.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Mantisus and others added 5 commits January 26, 2026 16:37
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Updating request queue metada performs full table scan in SQL storage

3 participants