In Azure Data Factory (ADF), a Delete Activity is used to
remove files or folders from a file-based store (like Azure Blob Storage, Azure
Data Lake Storage, Amazon S3, SFTP, etc.) based on specified conditions. This
activity is useful for cleaning up data that is no longer needed as part of a
data processing workflow.
ADF – Important
Considerations for Delete Activity
Definition:
This is an activity in Azure Data Factory, using which you
can delete files and folders from on-premises or online data stores. In short,
you can use this activity for the cleanup.
- Deleted files cannot be restored unless the
soft delete is enabled for storage.
- Be cautious and backup your files before
running this activity. Before applying in production, test the
Delete Activity in a development environment to ensure it only deletes the
intended files or folders and does not affect other data.
- Make sure you are not deleting and creating
the file at the same time.
- Make sure the ADF managed identity or the
service principal you are using has the necessary
permissions to delete files in the storage account.
ADF – Delete activity implementation
Requirement:
-
Delete files from Azure BLOB storage.
Implementation:
- Create a dataset. This is the source
location from where the files/folders will be deleted. As shown in the
screenshot below, you can either select a single file or folder.
- Add a Delete activity.
- Go to the Source tab. There are a few
properties you need to set.
- Dataset: This is the location of the
dataset that needs to be deleted entirely/partially.
- File path type:
- File path in dataset: Consider the
selected dataset as the location of the files to be deleted.
- Wildcard file path: Select specific files
from the selected dataset using wildcards (e.g., *.csv, will pick all the csv
files from the selected path.).
- Prefix: File or folder name starting with
specific name (e.g., Files starting with account_).
- List of files: Point to a text file that
lists each file (relative path to the path configured in the dataset) that you
want to delete.
- Filter by last modified: The files with
last modified time in the range [Start time, End time) will be filtered for
further processing.
- Recursively: Process all files in the
input folder and its subfolders recursively or just the ones in the selected
folder. This setting is disabled when a single file is selected.
- Max concurrent connections: The upper
limit of concurrent connections established to the data store during the
activity run. Specify a value only when you want to limit concurrent
connections.
- Retention
time: Set a retention period, so files older than this
period will be deleted.
- Publish Changes: After configuring
the Delete Activity, remember to save and publish your changes to make the
activity part of your ADF pipeline.
ADF Delete Activity - Advanced Options
- Dynamic Content: You can use dynamic content in the Delete
Activity to make its operation more flexible. For instance, you can dynamically
set the folder path or file name to be deleted based on pipeline parameters or
activity outputs. This is particularly useful in scenarios where the data to be
deleted varies from execution to execution.
- Dependency Conditions: In a complex pipeline, the Delete Activity
might depend on the successful completion of previous activities. ADF allows
you to configure dependency conditions to ensure the Delete Activity only runs
after certain conditions are met, enhancing your pipeline's robustness.
- Logging and Monitoring: Configure diagnostic settings for your Data
Factory to capture detailed logs of your Delete Activity executions. Monitoring
these logs can help you audit data deletions and troubleshoot issues if the
activity does not behave as expected.
Usage Scenarios of ADF Delete Activity
- Cleanup Temporary Files: After processing
data, use the Delete Activity to remove temporary files that were generated
during the process.
- Maintain Folder Size: Regularly delete
old files based on the retention policy to prevent storage from growing
uncontrollably.