Copying data from On-Premise ODBC to Sql Azure using Azure Data Factory Copy Activity

Azure Data Factory (ADF) is a great cloud product for copying virtually any kind of data: on-premise, in the cloud, or between on-premise and the cloud. It has user designed pipelines with activities to transfer data and is configured using JSON templates. However, writing these templates is not easy. As a starter Microsoft have provided the Copy Activity Wizard. This wizard will walk you through the creation of a single pipeline. But that is it. You cannot then use the wizard to re-edit it in future. You will need to start manually editing the JSON templates.  Once you do that, you will soon find that you need to switch to Visual Studio for editing, source control and also deployment of your templates to multiple environments.

You can download the ADF tools for Visual Studio and create an ADF project. The diagram below shows a typical Visual Studio ADF project that can copy data between on premise ODBC data source and SQL Azure for different environments. It also includes an on-premise local MS SQL Server linked service.

After you have provisioned an ADF instance, the first thing to grasp about ADF is that you actually need at least five JSON templates to setup a single copy process:
Two linked service files, that describe your source and sink data e.g ODBC, SQL Azure etc.
Two dataset files (shown as tables in VS), that describe the shape or schema of the data to be copied.
A single pipeline template with a Copy Activity.
If you use the recommended way of developing ADF using Visual Studio then you will need an additional config file per environment, as shown above. For a great blog in using VS config files click here.

For on-premise support, you will need to install, register and configure credentials in an ADF Data Management Gateway (DMG). Configuring on-premise credentials can be difficult in a production environment where inbound ports are blocked,  however the remedy is usually a switch to PowerShell – but that is another story.
The DMG is in fact a Windows Service that sits on a VM on-premise that in normal service connects to Azure and the cloud through outbound connections. It connects to Azure by actually subscribing to a Service Bus Queue that was setup when registering the DMG with the ADF instance.

To connect to an ODBC source, you first need to install the appropriate ODBC driver on the same VM that hosts the DMG and then create a system DSN referencing the driver which the ADF linked service can use.

The next step is to create the source ODBC linked server template file OdbcLinkedService.json and sink SQL Azure linked server file AzureSqlLinkedService.json respectively. Note that the content of these files are mandatory but are in fact empty as each value will be different for each environment: Dev, Uat, Prod etc. At deployment VS will merge the final JSON together.

{
    "name": "OdbcLinkedService",
    "properties": {
       "description": "",
       "hubName": "",
       "type": "OnPremisesOdbc",
       "typeProperties": {
          "connectionString": "",
          "credential": "",
          "gatewayName": "",
          "userName": "",
          "password": "",
          "authenticationType": "Basic"
        }
    }
}

{
  "name": "AzureSqlLinkedService",
  "properties": {
    "description": "",
    "hubName": "",
    "type": "AzureSqlDatabase",
    "typeProperties": {
      "connectionString": ""
    }
  }
}

We can then create the VS environment configuration files that will store the values that vary by environment. The example below is for deployment to the Dev environment: DevConfig.json. Note this includes both linked services and a pipeline -which we will cover to later.

{
  "OdbcLinkedService": [
    {
      "name": "$.properties.hubName",
      "value": "AcmeDFDev_hub"
    },
    {
      "name": "$.properties.typeProperties.connectionString",
      "value": "driver={InterSystems ODBC35};server=acmesvr1;database=AcmeDEV"
    },
    {
      "name": "$.properties.typeProperties.gatewayName",
      "value": "AcmeDMGDev"
    },
    {
      "name": "$.properties.typeProperties.userName",
      "value": "Acmedevodbc"
    },
    {
      "name": "$.properties.typeProperties.password",
      "value": "abcdefg"
    },
    {
      "name": "$.properties.typeProperties.authenticationType",
      "value": "Basic"
    }
  ],
  "AzureSqlLinkedService": [
    {
      "name": "$.properties.hubName",
      "value": "AcmeDFDev_hub"
    },
    {
      "name": "$.properties.typeProperties.connectionString",
      "value": "Data Source=tcp:abcdef.database.windows.net;Initial Catalog=dba_Acme;Integrated Security=False;User ID=admin_dev;Password=asdfgbvd;Connect Timeout=30;"
    }
  ],
   
  "CountryPipeline": [
    {
      "name": "$.properties.hubName",
      "value": "AcmeDFDev_hub"
    },
    {
      "name": "$.properties.start",
      "value": "2017-06-22T00:00:00Z"
    },
    {
      "name": "$.properties.end",
      "value": "2099-01-01T00:01:00Z"
    },
    {
      "name": "$.properties.isPaused",
      "value": false
    }
}

Next, we need a dataset (or table) file that describes the shape of the source data so that ADF can serialize it. Note: external is set to true. The example below dsInputCountry.json is for copying country data daily at 6pm GMT .


{
  "name": "dsInputCountry",
  "properties": {
    "structure": [
      {
        "name": "Code",
        "type": "String"
      },
      {
        "name": "Name",
        "type": "String"
      },
      {
        "name": "Iso",
        "type": "String"
      }
    ],
    "published": false,
    "type": "RelationalTable",
    "linkedServiceName": "OdbcLinkedService",
    "typeProperties": {
      "tableName": "tblCountry"
    },
    "availability": {
      "frequency": "Day",
      "interval": 1,
      "offset":  "18:00:00"
    },
    "external": true,
    "policy": {}
  }
}

As we want to push the data into SQL Azure then we must also create the sink dataset. This example dataset, dsOutputCountry.json, specifies a MS SQL User Defined Table which we will need to first create in the sink database. It will be a UdtCountry UDT type. When a UdtCountry object is instantiated will be named udtCountry. For simplicity we can make the schema of UdtCountry identical to the dsInputCountry schema.

{
  "name": "dsOutputCountry",
  "properties": {
    "published": false,
    "type": "AzureSqlTable",
    "linkedServiceName": "AzureSqlLinkedService",
    "typeProperties": {
      "tableName": "udtCountry"
    },
    "availability": {
      "frequency": "Day",
      "interval": 1,
      "offset": "18:00:00"
    },
    "external": false,
    "policy": {}
  }
}

Finally we create the pipeline CountryPipeline.json file with a Copy Activity to do the work. This pipeline first deletes rows from the SQL Azure target table dbo.Country. Next it selects the data from ODBC table tblCountry, instantiates a UdtCountry object on SQL Azure and populates it with country data. Finally it calls a stored procedure on SQL Azure called uspCountry supplying udtCountry as an argument repopulating the Country table.

{
  "name": "CountryPipeline",
  "properties": {
    "description": "",
    "activities": [
      {
        "type": "Copy",
        "typeProperties": {
          "source": {
            "type": "RelationalSource",
            "query": "select code, name, iso from tblCountry"
          },
          "sink": {
            "type": "SqlSink",
            "sqlWriterStoredProcedureName": "uspCountry",
            "storedProcedureParameters": {},
            "sqlWriterTableType": "udtCountry",
            "sqlWriterCleanupScript": "delete from [dbo].[Country]",
            "writeBatchSize": 0,
            "writeBatchTimeout": "00:00:00"
          }
        },
        "inputs": [
          {
            "name": "dsInputCountry"
          }
        ],
        "outputs": [
          {
            "name": "dsOutputCountry"
          }
        ],
        "policy": {
          "timeout": "1.00:00:00",
          "concurrency": 1,
          "style": "StartOfInterval",
          "retry": 3,
          "longRetry": 0,
          "longRetryInterval": "00:00:00"
        },
        "scheduler": {
          "frequency": "Day",
          "interval": 1,
          "offset": "18:00:00"
        },
        "name": "CopyActivity"
      }
    ],
    "start": "",
    "end": "",
    "isPaused": false,
    "hubName": "",
    "pipelineMode": "Scheduled"
  }
}

In this post we have extracted data from an ODBC source. ADF does not currently support pushing data back into ODBC. However a useful trick can be to use an on premise MS SQL Server instance and configure ADF to call a local stored procedure supplying a UDT. The stored procedure can then push into ODBC using a SQL Linked server.

Leave a Reply

Your email address will not be published. Required fields are marked *