Technical Articles
Conversion of Json data to ORC and CSV format using Apache Spark Library
1. Introduction
This is a continuation of the previous blog, In this blog we will describes about the conversion of json data to parquet format. Using the spark and its dependent library as explained in the previous blog section 2.2. Although the convert of Json data to CSV format is only one inbuilt statement apart from the parquet file converts code snapshots in previous blog. Similarly goes with the CSV, ORC format conversion from the json data.
2. CSV format conversion approach
In this method the json input data will be converted it to csv format data. The code snapshot shown below
private File convertToCSV(List<Device> list) {
JavaSparkContext sparkContext = null;
File tempFile = null;
try (SparkSession spark = SparkSession.builder()
.master("local[4]")
.appName("sample-application")
.getOrCreate()) {
tempFile = this.createTempFile();
Gson gson = new Gson();
List<String> data = Arrays.asList(gson.toJson(list));
sparkContext = JavaSparkContext.fromSparkContext(SparkContext.getOrCreate());
Dataset<String> stringDataSet = spark.createDataset(data, Encoders.STRING());
Dataset<Row> csvDataSet = spark.read().json(stringDataSet);
log.info("Inserted json conversion schema and value");
csvDataSet.printSchema();
csvDataSet.show();
if (tempFile != null) {
csvDataSet.write()
.option("compression","GZIP")
.csv(tempFile.getPath());
tempFile = this.retrieveCSVFileFromPath(tempFile);
}
} catch (Exception ex) {
log.error("Stack Trace: {}", ex);
} finally {
if (sparkContext != null) {
sparkContext.close();
}
}
return tempFile;
}
//retrieve the file after json to csv converted data
private File retrieveCSVFileFromPath(File tempFilePath) {
List<File> files = Arrays.asList(tempFilePath.listFiles());
return files.stream()
.filter(
tmpFile -> tmpFile.getPath().contains(FILE_EXTENSION) && tmpFile.getPath().endsWith(FILE_EXTENSION))
.findAny()
.orElse(null);
}
As shown above the jsonToCSV, which reads the json data and convert it to csv. retrieveCSVFileFromPath method will retrieve the converted data file path.
3. Sample Data Input and Output:
Sample Input data can be the same as mentioned in the previous blog section 4. The out put will be in comma-separated format without header.
4. JSON format to ORC format file conversion
Apache ORC (Optimized Row Columnar) is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is like the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is compatible with most of the data processing frameworks in the Hadoop environment. For more about the ORC file format refer below the link
Code snapshot for ORC file conversion: Here using the spark jar, Able to convert the json object to ORC, which takes less space, almost 75 % less than the normal text file. So, converting to ORC will reduce the storage cost. It also allows us to store the data in compressed format, still quarriable of data. Which provides a great advantage of storing the data in this format, Following are the code snapshot of ORC convert.
public static final String FILE_EXTENSION_ORC = ".orc";
private File convertToORC(List<Device> list) {
JavaSparkContext sparkContext = null;
File tempFile = null;
try (SparkSession spark = SparkSession.builder()
.master("local[4]")
.appName("sample-application")
.getOrCreate()) {
tempFile = this.createTempFile();
Gson gson = new Gson();
List<String> data = Arrays.asList(gson.toJson(list));
sparkContext = JavaSparkContext.fromSparkContext(SparkContext.getOrCreate());
Dataset<String> stringDataSet = spark.createDataset(data, Encoders.STRING());
Dataset<Row> orcDataSet = spark.read().json(stringDataSet);
log.info("Inserted json conversion schema and value");
orcDataSet.printSchema();
orcDataSet.show();
if (tempFile != null) {
orcDataSet.write().orc(tempFile.getPath());
tempFile = this.retrieveORCFileFromPath(tempFile);
}
} catch (Exception ex) {
log.error("Stack Trace: {}", ex);
} finally {
if (sparkContext != null) {
sparkContext.close();
}
}
return tempFile;
}
//To get the converted .orc file path
private File retrieveORCFileFromPath(File tempFilePath) {
List<File> files = Arrays.asList(tempFilePath.listFiles());
return files.stream()
.filter(
tmpFile -> tmpFile.getPath().contains(FILE_EXTENSION_ORC) && tmpFile.getPath().endsWith(FILE_EXTENSION_ORC))
.findAny()
.orElse(null);
}
As mentioned in the code the spark library takes cares the conversion from json to .orc format and we need to read the tempfile path and that would be used to push or save it to the AWS S3.
Sample Input data can be the same as mentioned in the previous blog section 4. The out put will be in binary format.
Conclusion
In this blog, I have shown the approach with sample code to convert any input json object of even the nth nested structured can be parsed and converted into the CSV and ORC format. Using this approach, we can easily convert the json object to CSV or ORC. Only drawback would be the jar or the war size would be increased drastically, Since spark library is large and it has lot of dependent libraries. In the next blog I will be showing how any format of file can be uploaded to the AWS S3 bucket.
Few of my other blogs:-
- ODATA implementation for IOT Applications on SAP HCP
- Logging in SCP with Java & Tomee using slf4j, logback, Jolokia
- CDS extension feature in HANA
- Database migration in SCP using Liquibase
- Binary content upload & download from HANA DB via Apache Olingo OData in SCP
- Introduction to Annotations & Vocabularies
- Annotation’s usage in SAP S/4HANA Intelligent product design