In the modern cloud world, moving and transforming data is an everyday necessity. Whether you need to load data from multiple sources into a central data warehouse, synchronize databases or prepare data for analytics workloads, a reliable data integration tool is essential. That’s where Azure Data Factory (ADF) comes in, Microsoft’s powerful and scalable cloud ETL and data integration service.
Are you new to ADF and wondering how to get started? Don’t worry! This Azure Data Factory tutorial provides you with a practical ADF introduction and guides you step-by-step through the creation of your very first data pipeline. By the end, you’ll understand how to automate simple data copy tasks in Azure.
Prerequisite: You need an active Azure subscription. If you don’t have one yet, you can often start with a free trial account.
What is Azure Data Factory (brief overview)?
Azure Data Factory is a fully managed, serverless data integration service. Think of it as an orchestration platform that allows you to create, schedule and monitor data flows (called pipelines). These pipelines can retrieve data from a variety of sources (on-premises or in the cloud), transform it and load it into different destinations.
The core components of ADF are:
- Pipelines: A logical network of activities that perform a task together.
- Activities: Individual processing steps in a pipeline (e.g. copy data, execute stored procedure, execute data flow).
- Linked Services: Define the connection information to external resources (e.g. databases, file storage, cloud services). Like a connection string.
- Datasets (data sets): Represent the structure of the data within the data storage (e.g. a specific table, a file, a folder).
- Triggers: Define when a pipeline should be executed (manually, by schedule, event-based).
Scenario for our tutorial
To understand the basics, let’s create a very simple but common task: we copy a file from one location in Azure Blob Storage to another folder in the same or a different Blob Storage account.
Step-by-step guide: Your first ADF pipeline
Follow these steps to create your first pipeline:
Step 1: Create Azure Data Factory
- Log in to the Azure Portal.
- Click on “+ Create resource”.
- Search for “Data Factory” and select the service.
- Click on “Create”.
- Fill in the required fields:
- Subscription: Select your Azure subscription.
- Resource group: Select an existing one or create a new one (e.g.
rg-adf-tutorial
). - Region: Select a region near you (e.g. “West Europe”).
- Name: Enter a unique name for your data factory (e.g.
adf-ailio-tutorial
). - Version: Make sure that
V2
is selected. - (Optional) Configure Git integration, network and tags as required (for this tutorial you can leave the default settings).
- Click on “Check + create” and then on “Create”. The deployment will take a few minutes.
Step 2: Start ADF Studio
- Once the deployment is complete, navigate to your newly created Data Factory resource.
- On the overview page, click on the “Open” tile under “Open Azure Data Factory Studio”. This opens the visual development environment in a new tab.
Step 3: Create linked services
We need connections to our source and target memory.
- Click on the wrench symbol (“Manage”) in the ADF Studio on the left.
- Go to “Linked services” and click on “+ New”.
- Search for “Azure Blob Storage” and select it. Click on “Next”.
- Configure the source-linked service:
- Name: Enter a name (e.g.
ls_blob_source
). - Integration Runtime: Leave it at
AutoResolveIntegrationRuntime
. - Authentication method: Select a suitable method (e.g. “Account key”, “System-assigned managed identity” if ADF has access to the storage).
- Azure subscription & storage account name: Select your subscription and the storage account where your source file is located.
- Click on “Test connection” to make sure that everything is working.
- Click on “Create”.
- Name: Enter a name (e.g.
- Repeat steps 3 and 4 to create a target linked service (e.g.
ls_blob_sink
) that points to the target storage account (can be the same account).
Step 4: Create datasets
Now we define which data (folders/files) we want to access.
- Click on the pencil icon (“Create”) on the left in ADF Studio.
- Move the mouse pointer over “Datasets” and click on the three dots (…), then on “+ New dataset”.
- Select “Azure Blob Storage” as the data storage and click “Next”.
- Select the format of your file (e.g. “Binary” for a 1:1 copy or “DelimitedText” for CSV). For this example, we will use “Binary”. Click on “Next”.
- Configure the source dataset:
- Name: Enter a name (e.g.
ds_blob_source_file
). - Linked service: Select the previously created
ls_blob_source
. - File path: Navigate to the container and, if applicable, the folder/file that you want to copy. Leave the file name blank if you want to copy an entire folder or enter the specific file name.
- Click on “OK”.
- Name: Enter a name (e.g.
- Repeat steps 2-5 to create a target dataset (e.g.
ds_blob_sink_folder
). Selectls_blob_sink
as the linked service and enter the target container and, if applicable, the target folder. Leave the file name empty, as the copy activity takes this from the source.
Step 5: Create pipeline
- Move the mouse pointer over “Pipelines” and click on the three dots (…), then on “+ New Pipeline”.
- Give your pipeline a name in the properties area on the right (e.g.
pl_copy_blob_to_blob
). - Expand the “Move and transform” section in the “Activities” area.
- Drag and drop the “Copy data” activity onto the empty pipeline canvas.
Step 6: Configure Copy Data activity
- Click on the “Copy data” activity on the canvas.
- Go to the “Source” tab in the lower area.
- Select your source dataset (
ds_blob_source_file
) from the drop-down list. - Go to the “Sink” tab.
- Select your target dataset (
ds_blob_sink_folder
). - (Optional) Explore the tabs “Assignment” (to assign columns, relevant for structured data) and “Settings” (for timeouts, retries etc.). The default settings are sufficient for our simple example.
Step 7: Debug/execute pipeline
- Click on “Debug” above the pipeline canvas. This executes the pipeline immediately without publishing it (ideal for testing).
- Switch to the “Output” tab in the lower area. Here you can see the progress and the result of the debug run (status: “In progress”, “Successful”, “Error”).
- (Optional) To save your pipeline permanently, click on “Publish all” at the top. To run it daily, for example, you could add a “Trigger” (“+ New/Edit” under Triggers).
Step 8: Check result
Navigate to your target blob storage container in the Azure Portal or with the Azure Storage Explorer. You should now find the copied file there!
Summary & next steps
Congratulations to you! You have just created and successfully executed your first Azure Data Factory pipeline. You have learned how to:
- Create a Data Factory instance.
- Establish connections to data storage (linked services).
- Define data structures (datasets).
- Create and configure a pipeline with a copy activity.
- Test your pipeline (debugging).
Of course, this is just the beginning. Azure Data Factory offers a huge range of connectors and activities, including complex data transformations with mapping data flows, code execution (Azure Functions, Databricks Notebooks), control flow logic and much more.
Would you like to delve deeper or do you need support with more complex data integration scenarios in Azure?
Ailio is your experienced partner for Azure Data Engineering and Data Science. Contact us to find out how we can optimize your data integration!