In Azure Data Factory
(ADF), obtaining the row count from a data source involves using a combination
of activities within a pipeline. The most common method is to use a Lookup
activity to execute a query that counts the rows in the dataset.
Create a Dataset
- Open Azure Data Factory Studio: Go to your Data
Factory instance and open the ADF Studio.
- Note: If we use Lookup
activity on top of a data set, we’ll directly get the number of rows as an
output of Lookup activity BUT lookup only supports 5000 records, that’s why we
have created this example to get the count of the rows even if it is more than
5000 rows.
- Create a Dataset: In the “Author” tab,
create a dataset that points to the data source (such as a table in Azure SQL
Database, Azure Blob Storage, etc.) for which you want to count the rows. Here, we are going to store excel file in
Azure BLOB storage and consider this file as our data source. We are going
to count the rows from this excel file.
- In the Parameters section, define this
parameter- VarStartFinal. You can specify the default value as shown.
A2:A3 is the range of the excel file that means in column - A, we
are going to consider from row 2 to 3 (This is just a default value so
don’t worry).
- For the same dataset, go to Connection
tab and click on Add dynamic content in Range field like this.
- The expression builder window will open.
Click on Parameters >> VarStartFinal (the variable we defined earlier).
Create a Pipeline
- Still
in the “Author” tab, create a new pipeline.
- Put
all these variables in pipeline.
- Now,
put an Until action in the pipeline. For Until action, click on Settings
tab and put this dynamic content in Expression.
@equals(variables('CurrentCount'),0) - Basically,
we are going to perform all the activities within Until action repetitively,
till the condition of Until action is fulfilled.
- We
are going to add four actions within Until action.
- Lookup: Click on the Settings tab for Lookup
activity. Select Source dataset and specify the dataset property as shown.
@concat('A',variables('StartNumber'),':','A',variables('EndNumber')
- Set Variable: Now, click on the Set Variable action
>> Settings tab. We are setting
the value for the variable – CurrentCount. Paste this value in Value
field.
@activity('Lookup3').output.count
- Set Variable: We are going to set the second
variable. We will set the variable – CurrentTotalCount
@add(activity('Lookup3').output.count,variables('TotalCount'))
- If Condition: Now, within If condition, we have 3
more actions for set variable but first let’s look at the If condition. Put this
for If condition expression.
@greater(activity('Lookup3').output.count, 0)
- Now
let’s look at the 3 activities within If condition. All these 3 activities are
in True statement of the If condition. We are keeping False (Else) part empty.
- Set Variable: Set the first variable – TotalCount
with below formula.
@variables('CurrentTotalCount')
- Set Variable: Set the second variable – StartNumber
with below formula.
@string(add(variables('TotalCount'),1))
- Set Variable: Set the third variable – EndNumber
with below formula.
@string(add(variables('TotalCount'),4999))
- Configure the Lookup Activity:
- Link
the activity to the dataset you created.
- In
the settings of the Lookup activity, you will need to provide a query that
returns the row count. For example, if your dataset is a SQL database, your
query will be something like SELECT COUNT(*) AS RowCount FROM
YourTableName.
- After
you run the pipeline, when you check the variable – EndNumber value, you
will get the total number of rows in the excel file.
- Debug: Test the pipeline using the Debug feature to make sure it correctly
retrieves the row count.
- Publish: Once the pipeline works as expected, publish your changes.
Additional Tips
- Error Handling: Implement error handling in the Lookup activity to manage scenarios
like empty results or connectivity issues.
- Performance Considerations: If the dataset is very
large, consider the performance impact of running a full count query. In some
cases, it may be beneficial to store and update a row count in a separate control
table, especially for very large datasets.
- Incremental Loads: For incremental loading scenarios, consider
maintaining a row count as part of your incremental load metadata to avoid
having to count the entire dataset each time.