Data Engineer Masters Program (10 Blogs) Become a Certified Professional

Copy Activity in Azure Data Factory and Azure Synapse Analytics

Published on Oct 14,2024 30 Views

Sunita Mallick
Experienced tech content writer passionate about creating clear and helpful content for... Experienced tech content writer passionate about creating clear and helpful content for learners. In my free time, I love exploring the latest technology.

Azure Data Factory (ADF) and Azure Synapse Analytics are some of the instrumental tools used when it comes to data integration and data transformation. Another element that can be identified in both services is the copy operation, with the help of which data can be transferred between different systems and formats. 

This activity is rather critical of migrating data, extending cloud and on-premises deployments, and getting data ready for analytics. In this all-encompassing tutorial blog, we are going to give a detailed explanation of the Copy activity with special attention to datastores, file type, and options. 

Supported Data Stores and Formats

Azure Data Factory and Azure Synapse Analytics support a vast array of data stores for the Copy activity. 

These include:

  • Azure Services: This is because copying volumes of data from one service to another is very easy with full support for Microsoft Azure Blob Storage, Azure Data Lake Storage Gen 1 and Gen 2, Azure SQL Data Base, and Azure Synapse Analytics.
  • Databases: The most used relational database platforms, such as SQL Server, Oracle, MySQL, and PostgreSQL databases, are recognized both as source and sink platforms. Also integrated are the cloud-based databases, such as the Amazon RDS for Oracle and SQL Server and Google Big Query, to name but a few.
  • NoSQL Stores: As source systems, Cassandra and MongoDB (including MongoDB Atlas), NoSQL databases are supported to make the integration of the unstructured data easy.
  • File Systems: Data from several file systems, including FTP, SFTP, HDFS, and different cloud storages such as Amazon S3, Google cloud storage, etc., can be ingested in Azure.
  • Business Applications: Salesforce and Dynamics as the CRM systems are natively supported, and other business apps like ServiceNow and SharePoint Online are also supported so that data from the enterprise applications can be imported.

This broad list of supported data stores means that you can connect to data from nearly any source and pull it into Azure for further processing.

Supported File Formats

The copy activity in azure data factory supports a diverse set of file formats, making it flexible for various data scenarios:

  • Avro, ORC, and Parquet: These columnar formats are typical for big data for the same reason, namely, because of the effectiveness of storage and data retrieval.
  • JSON and XML: These structured formats are common in websites and applications and used in transferring data from one system to another.
  • Delimited Text (CSV): This is a protocol for data sharing, especially in basic structures and in old models of data sharing.
  • Excel: Can handle the imports and exports in Excel format, which is very important when analyzing data to be integrated with business tools.
  • Binary Format: For unformatted binary data applicable where data is not fixed in form.

The supported formats include editing of different data types to help achieve integration with your existing systems.

Supported Regions

Azure Data Factory and Synapse Analytics are present in almost every Azure geography in the world. Regarding the setup of Your Copy activity, make sure that your services and data stores are in the same or a similar region to avoid the instance of unnecessary latency.

Configuration

Configuring the Copy activity involves several steps:

  • Create Linked Services: Linked services specify the linkage of the various data sources and sinks allowed in a data project. For instance, if you are copying data from the Azure SQL Database to Blob Storage, both will require linked services.
  • Create Datasets: Typically, datasets describe the data you wish to transfer. You will now create datasets for both source and sink where you will be to set parameters like file path, tables, or queries.

Configure the Copy Activity: Once you have defined linked services and datasets in the Data Factory service, you then define the Copy activity, thereby completing the pipeline. Here you are setting such attributes as data integration units, parallel copies, and fault tolerance.

Syntax

Synapse pipelines or copy activity in azure data factory syntax usually includes source and sink attributes as well as other optional parameters.

Below is a simple example:

{
    "name": "CopyFromBlobToSql",
    "type": "Copy",
    "inputs": [
        {
            "referenceName": "InputDataset"
        }
    ],
    "outputs": [
        {
            "referenceName": "OutputDataset"
        }
    ],
    "typeProperties": {
        "source": {
            "type": "BlobSource"
        },
        "sink": {
            "type": "SqlSink"
        }
    }
}

Syntax Details

In the above syntax:

  • Inputs and Outputs refer to the datasets (source and destination).
  • Source defines the data source (e.g., BlobSource).
  • Sink defines the data destination (e.g., SqlSink).

You can further configure additional properties, such as fault tolerance and logging options.

Monitoring

Azure provides robust monitoring features for tracking the progress and performance of your Copy activities. You can view pipeline run histories, monitor data movement in real-time, and set up alerts for failures or performance issues. This ensures that you can troubleshoot and optimize your data integration processes effectively.

Incremental Copy

Incremental copy is a feature that allows you to transfer only the data that has changed since the last run, rather than copying the entire dataset every time. This is particularly useful for large datasets where only a small portion of the data changes regularly. Incremental copy reduces the amount of data transferred, thereby improving performance and reducing costs.

Performance and Tuning

Performance can be optimized in several ways:

  • Parallel Copy: Configure the Copy activity to use parallelism, which enables simultaneous reading from the source and writing to the sink, reducing overall execution time.
  • Staging Data: For large datasets, stage data in Azure Blob Storage before moving it to the final destination. This reduces the load on the source system and speeds up data transfer.
  • Data Consistency: After data is copied, verify consistency between the source and destination to ensure data integrity.

Resume from Last Failed Run

In case of failure during data transfer, Azure Data Factory and Synapse Analytics allow you to resume the Copy activity from the last failed run, rather than starting over. This saves time and ensures data continuity.

Preserve Metadata Along with Data

When copying data, you can also choose to preserve metadata such as column names, data types, and file properties. This ensures that the data remains consistent and usable after transfer.

Add Metadata Tags to File-Based Sink

For file-based sinks, you can add metadata tags to the files during the copy process. These tags can include information like the source of the data, the date of transfer, and other custom tags that help in data management and organization.

Schema and Data Type Mapping

Azure Data Factory supports schema and data type mapping between source and destination. This allows for seamless data transformation, ensuring that data types are compatible between different systems.

Add Additional Columns During Copy

You can add additional columns to your data during the copy process. This can be useful for adding metadata or calculated fields to the data as it moves between systems.

Auto Create Sink Tables

The Copy activity can automatically create tables in the sink destination if they do not already exist. This is particularly useful when integrating with new or dynamic data sources.

Fault Tolerance

Fault tolerance settings allow the copy activity in azure data factory to continue running even if some rows fail to copy. You can configure the activity to skip failed rows, log the errors, and continue with the rest of the data transfer.

Data Consistency Verification

After the Copy activity completes, you can verify data consistency between the source and the sink. This ensures that all data has been transferred correctly and that there are no discrepancies.

Session Log

Session logs provide detailed information about the data transfer process, including the number of rows copied, any errors encountered, and the overall performance of the activity. These logs are essential for monitoring and troubleshooting your data integration processes.

For more in-depth knowledge and hands-on experience with Azure Data Factory, consider enrolling in an Azure Data Engineering Courses Online. This course covers data movement, transformation, and orchestration in detail, equipping you with the skills to manage complex data engineering tasks on Azure.

FAQs

How to improve copy activity performance in Azure Data Factory?

Allow to achieve parallelism, fine-tune queries, and perform massive data loading to Azure Blob Storage to reduce data transfer time.

Which 3 types of activities can you run in Microsoft Azure Data Factory?

Data Movement enables Copy; Data Transformation enables Data Flow; Control enables Execute Pipeline.

How do I copy multiple files in Azure Data Factory?

Use wildcard paths, or turn on recursive copy in datasets, or use the ‘ForEach’ activity in order to loop through many files.

Which Azure Data Factory Integration runtime would be used in a data copy activity?

It is recommended to employ Azure IR for cloud-based data, Self-hosted IR for the on-premises environment, and Azure-SSIS IR for SSIS packages.

Upcoming Batches For Data Engineer Masters Program
Course NameDateDetails
Data Engineer Masters Program

Class Starts on 28th December,2024

28th December

SAT&SUN (Weekend Batch)
View Details
Comments
0 Comments

Join the discussion

Browse Categories

webinar REGISTER FOR FREE WEBINAR
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

Subscribe to our Newsletter, and get personalized recommendations.

image not found!
image not found!

Copy Activity in Azure Data Factory and Azure Synapse Analytics

edureka.co