Copy Activity in Azure Data Factory and Azure Synapse Analytics

Become a Certified Professional

Azure Data Factory (ADF) and Azure Synapse Analytics are some of the instrumental tools used when it comes to data integration and data transformation. Another element that can be identified in both services is the copy operation, with the help of which data can be transferred between different systems and formats.

This activity is rather critical of migrating data, extending cloud and on-premises deployments, and getting data ready for analytics. In this all-encompassing tutorial blog, we are going to give a detailed explanation of the Copy activity with special attention to datastores, file type, and options.

Supported Data Stores and Formats

Azure Data Factory and Azure Synapse Analytics support a vast array of data stores for the Copy activity.

These include:

Azure Services: This is because copying volumes of data from one service to another is very easy with full support for Microsoft Azure Blob Storage, Azure Data Lake Storage Gen 1 and Gen 2, Azure SQL Data Base, and Azure Synapse Analytics.
Databases: The most used relational database platforms, such as SQL Server, Oracle, MySQL, and PostgreSQL databases, are recognized both as source and sink platforms. Also integrated are the cloud-based databases, such as the Amazon RDS for Oracle and SQL Server and Google Big Query, to name but a few.
NoSQL Stores: As source systems, Cassandra and MongoDB (including MongoDB Atlas), NoSQL databases are supported to make the integration of the unstructured data easy.
File Systems: Data from several file systems, including FTP, SFTP, HDFS, and different cloud storages such as Amazon S3, Google cloud storage, etc., can be ingested in Azure.
Business Applications: Salesforce and Dynamics as the CRM systems are natively supported, and other business apps like ServiceNow and SharePoint Online are also supported so that data from the enterprise applications can be imported.

This broad list of supported data stores means that you can connect to data from nearly any source and pull it into Azure for further processing.

Supported File Formats

The copy activity in azure data factory supports a diverse set of file formats, making it flexible for various data scenarios:

Avro, ORC, and Parquet: These columnar formats are typical for big data for the same reason, namely, because of the effectiveness of storage and data retrieval.
JSON and XML: These structured formats are common in websites and applications and used in transferring data from one system to another.
Delimited Text (CSV): This is a protocol for data sharing, especially in basic structures and in old models of data sharing.
Excel: Can handle the imports and exports in Excel format, which is very important when analyzing data to be integrated with business tools.
Binary Format: For unformatted binary data applicable where data is not fixed in form.

The supported formats include editing of different data types to help achieve integration with your existing systems.

Supported Regions

Azure Data Factory and Synapse Analytics are present in almost every Azure geography in the world. Regarding the setup of Your Copy activity, make sure that your services and data stores are in the same or a similar region to avoid the instance of unnecessary latency.

Configuration

Configuring the Copy activity involves several steps:

Create Linked Services: Linked services specify the linkage of the various data sources and sinks allowed in a data project. For instance, if you are copying data from the Azure SQL Database to Blob Storage, both will require linked services.

Create Datasets: Typically, datasets describe the data you wish to transfer. You will now create datasets for both source and sink where you will be to set parameters like file path, tables, or queries.

Configure the Copy Activity: Once you have defined linked services and datasets in the Data Factory service, you then define the Copy activity, thereby completing the pipeline. Here you are setting such attributes as data integration units, parallel copies, and fault tolerance.

Syntax

Synapse pipelines or copy activity in azure data factory syntax usually includes source and sink attributes as well as other optional parameters.

Below is a simple example:

{
    "name": "CopyFromBlobToSql",
    "type": "Copy",
    "inputs": [
        {
            "referenceName": "InputDataset"
        }
    ],
    "outputs": [
        {
            "referenceName": "OutputDataset"
        }
    ],
    "typeProperties": {
        "source": {
            "type": "BlobSource"
        },
        "sink": {
            "type": "SqlSink"
        }
    }
}

Syntax Details

In the above syntax:

Inputs and Outputs refer to the datasets (source and destination).
Source defines the data source (e.g., BlobSource).
Sink defines the data destination (e.g., SqlSink).

You can further configure additional properties, such as fault tolerance and logging options.

Monitoring

Azure provides robust monitoring features for tracking the progress and performance of your Copy activities. You can view pipeline run histories, monitor data movement in real-time, and set up alerts for failures or performance issues. This ensures that you can troubleshoot and optimize your data integration processes effectively.

Incremental Copy

Incremental copy is a feature that allows you to transfer only the data that has changed since the last run, rather than copying the entire dataset every time. This is particularly useful for large datasets where only a small portion of the data changes regularly. Incremental copy reduces the amount of data transferred, thereby improving performance and reducing costs.

Performance and Tuning

Performance can be optimized in several ways:

Parallel Copy: Configure the Copy activity to use parallelism, which enables simultaneous reading from the source and writing to the sink, reducing overall execution time.
Staging Data: For large datasets, stage data in Azure Blob Storage before moving it to the final destination. This reduces the load on the source system and speeds up data transfer.
Data Consistency: After data is copied, verify consistency between the source and destination to ensure data integrity.

Resume from Last Failed Run

In case of failure during data transfer, Azure Data Factory and Synapse Analytics allow you to resume the Copy activity from the last failed run, rather than starting over. This saves time and ensures data continuity.

Preserve Metadata Along with Data

When copying data, you can also choose to preserve metadata such as column names, data types, and file properties. This ensures that the data remains consistent and usable after transfer.

Add Metadata Tags to File-Based Sink

For file-based sinks, you can add metadata tags to the files during the copy process. These tags can include information like the source of the data, the date of transfer, and other custom tags that help in data management and organization.

Schema and Data Type Mapping

Azure Data Factory supports schema and data type mapping between source and destination. This allows for seamless data transformation, ensuring that data types are compatible between different systems.

Add Additional Columns During Copy

You can add additional columns to your data during the copy process. This can be useful for adding metadata or calculated fields to the data as it moves between systems.

Auto Create Sink Tables

The Copy activity can automatically create tables in the sink destination if they do not already exist. This is particularly useful when integrating with new or dynamic data sources.

Fault Tolerance

Fault tolerance settings allow the copy activity in azure data factory to continue running even if some rows fail to copy. You can configure the activity to skip failed rows, log the errors, and continue with the rest of the data transfer.

Data Consistency Verification

After the Copy activity completes, you can verify data consistency between the source and the sink. This ensures that all data has been transferred correctly and that there are no discrepancies.

Session Log

Session logs provide detailed information about the data transfer process, including the number of rows copied, any errors encountered, and the overall performance of the activity. These logs are essential for monitoring and troubleshooting your data integration processes.

For more in-depth knowledge and hands-on experience with Azure Data Factory, consider enrolling in an Azure Data Engineering Courses Online. This course covers data movement, transformation, and orchestration in detail, equipping you with the skills to manage complex data engineering tasks on Azure.

FAQs

How to improve copy activity performance in Azure Data Factory?

Allow to achieve parallelism, fine-tune queries, and perform massive data loading to Azure Blob Storage to reduce data transfer time.

Which 3 types of activities can you run in Microsoft Azure Data Factory?

Data Movement enables Copy; Data Transformation enables Data Flow; Control enables Execute Pipeline.

How do I copy multiple files in Azure Data Factory?

Use wildcard paths, or turn on recursive copy in datasets, or use the ‘ForEach’ activity in order to loop through many files.

Which Azure Data Factory Integration runtime would be used in a data copy activity?

It is recommended to employ Azure IR for cloud-based data, Self-hosted IR for the on-premises environment, and Azure-SSIS IR for SSIS packages.

Copy Activity in Azure Data Factory and Azure Synapse Analytics

Supported Data Stores and Formats

Supported File Formats

Supported Regions

Configuration

Syntax

Syntax Details

Monitoring

Incremental Copy

Performance and Tuning

Resume from Last Failed Run

Preserve Metadata Along with Data

Add Metadata Tags to File-Based Sink

Schema and Data Type Mapping

Add Additional Columns During Copy

Auto Create Sink Tables

Fault Tolerance

Data Consistency Verification

Session Log

FAQs

How to improve copy activity performance in Azure Data Factory?

Which 3 types of activities can you run in Microsoft Azure Data Factory?

How do I copy multiple files in Azure Data Factory?

Which Azure Data Factory Integration runtime would be used in a data copy activity?

Recommended videos for you

HBase Tutorial – A Complete Guide On Apache HBase

Apache Spark Will Replace Hadoop ! Know Why

Introduction to Hadoop Administration

Big Data Processing with Spark and Scala

Secure Your Hadoop Cluster With Kerberos

What is Big Data and Why Learn Hadoop!!!

Real-Time Analytics with Apache Storm

Reduce Side Joins With MapReduce

Pig Tutorial – Know Everything About Apache Pig Script

5 Scenarios: When To Use & When Not to Use Hadoop

MapReduce Tutorial – All You Need To Know About MapReduce

Spark SQL | Apache Spark

Big Data Tutorial – Get Started With Big Data And Hadoop

Python for Big Data Analytics

Top Hadoop Interview Questions and Answers – Ace Your Interview

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Introduction to Big Data TDD and Pig Unit

Hive Tutorial – Understanding Hive In Depth

Bulk Loading Into HBase With MapReduce

Advanced Security In Hadoop Cluster

Recommended blogs for you

Apache Hadoop : Create your First HIVE Script

Splunk Use Case: Domino’s Success Story

Career Advantages of Hadoop Certification

Setting Up A Multi Node Cluster In Hadoop 2.X

How essential is Hadoop Training?

How to become a Hadoop Developer? Job Trends and Salary

How to Set Up Hadoop Cluster with HDFS High Availability

Spark Accumulators Explained: Apache Spark

What are the Key Terminologies in Hadoop Security?

Apache Falcon: New Data Management Platform For The Hadoop Ecosystem

Mastered Hadoop? Time to get started with Apache Spark

What is Big Data? – A Beginner’s Guide to the World of Big Data

Azure Synapse: Unlocking the Power of Your Data

Helpful Hadoop Shell Commands

Infographics: How Big is Big Data?

Apache Flink: The Next Gen Big Data Analytics Framework For Stream And Batch Data Processing

Apache Spark Architecture – Spark Cluster Architecture Explained

Big Bucks for Big Data Professionals: A Hype or Hope?

Why do we need Hadoop for Data Science?

Hive Tutorial – Hive Architecture and NASA Case Study

Join the discussionCancel reply

Trending Courses in Big Data

Microsoft Azure Data Engineering Training Cou ...

PySpark Certification Training Course

Microsoft Fabric Data Engineer Associate Trai ...

Apache Kafka Certification Training Course

Big Data Hadoop Certification Training Course

Applied Data Engineering on Azure Cloud Cours ...

Splunk Certification Training: Power User and ...

Apache Spark and Scala Certification Training ...

ELK Stack Training & Certification

Comprehensive MapReduce Certification Trainin ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.