Transform Data In Azure Data Factory Using Python Data Bricks
Solution 1:
It sounds like you want to transform a large number of single JSON file using Azure Data Factory, but it does not support on Azure now as @KamilNowinski said. However, now that you were using Azure Databricks, to write a simple Python script to do the same thing is easier for you. So a workaound solution is to directly use Azure Storage SDK and pandas
Python package to do that via few steps on Azure Databricks.
Maybe these JSON files are all in a container of Azure Blob Storage, so you need to list them in container via
list_blob_names
and generate their urls with sas token for pandasread_json
function, the code as below.from azure.storage.blob.baseblobservice import BaseBlobService from azure.storage.blob import ContainerPermissions from datetime import datetime, timedelta account_name = '<your account name>' account_key = '<your account key>' container_name = '<your container name>' service = BaseBlobService(account_name=account_name, account_key=account_key) token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),) blob_names = service.list_blob_names(container_name) blob_urls_with_token = (f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}"for blob_name in blob_names) #print(list(blob_urls_with_token))
Then, you can read these JSON file directly from blobs via
read_json
function to create their pandas Dataframe.import pandas as pd for blob_url_with_token inblob_urls_with_token: df = pd.read_json(blob_url_with_token)
Even if you want to merge them to a big CSV file, you can first merge them to a big Dataframe via pandas functions listed in
Combining / joining / merging
likeappend
.To write a dataframe to a csv file, I think it's very easy by
to_csv
function. Or you can convert a pandas dataframe to a PySpark dataframe on Azure Databricks, as the code below.from pyspark.sqlimportSQLContextfrom pyspark importSparkContext sc = SparkContext() sqlContest = SQLContext(sc) spark_df = sqlContest.createDataFrame(df)
So next, whatever you want to do, it's simple. And if you want to schedule the script as notebook in Azure Databricks, you can refer to the offical document Jobs
to run Spark jobs.
Hope it helps.
Solution 2:
Copy JSON file to storage (e.g. BLOB) and you can get access to the storage from Databricks. Then you can fix the file using Python and even transform to the required format having cluster run.
So, in Copy Data activity do the copy of the files to BLOB if you haven't them there yet.
Post a Comment for "Transform Data In Azure Data Factory Using Python Data Bricks"