Skip to Content
Technical Articles

Spring Boot app to covert Json to Parquet format using Apache spark library

1. Introduction

This is a series of blog where we will be describing about the spring Boot based application, which is an extension of the Spring framework that helps developers build simple and web-based applications quickly, with less code, by removing much of the boilerplate code and configuration that characterizes Spring. Here we can convert the json to a parquet format, Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level and is future-proofed to allow adding more encoding as they are invented and implemented. A detailed description about parquet format in mentioned below link

https://parquet.apache.org/documentation/latest/

https://en.wikipedia.org/wiki/Apache_Parquet

Here are the series of blogs

2. Prerequisite Activities

2.1 Creating and build Spring-boot Application

A maven-based Spring-boot web projects can be created using the Spring Initializer, In the dependency section, select the Spring Web starter, Spring dev tools, Spring security etc. Here mandatory dependencies are Spring web starter to build restful API, dev tools for local development and others are optional. However, you can add the dependencies based on your requirements. A step by step approach to create web application is mentioned in the below links

A project snapshot looks like as shown below

2.2 Get Dependencies library:

Following library needs to be added along with the spring-boot dependencies added while creating the project using Spring Initializer as shown below

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-devtools</artifactId>
			<scope>runtime</scope>
			<optional>true</optional>
		</dependency>
		<dependency>
			<groupId>org.apache.derby</groupId>
			<artifactId>derby</artifactId>
			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-sql_2.12</artifactId>
			<version>2.4.3</version>
		</dependency>
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-mllib_2.12</artifactId>
			<version>2.4.3</version>
			<scope>runtime</scope>
		</dependency>

In the above dependency, spring boot as well as Apache spark library are mentioned.

2.3 configuration of Lombok dependencies:

Project Lombok is a Java library tool that is used to minimize boilerplate code and save time during development. It is no doubt that Java is a great language, but recently, it is criticized by the community for one important reason — verbosity. It generates code, for example, getters, setters, and toString, and the IDE does the same thing for us only it generates in our source code while Lombok generates it in the “.class” file directly.  It is one of the helpful tools for the developers as it speed up the development. To download the library, refer link. However, to configure and understand more about the Lombok library refer the below link

The lombok dependency is mentioned below

<dependency>
 <groupId>org.projectlombok</groupId>
 <artifactId>lombok</artifactId>
</dependency>

2.4 POJO code snapshot as shown below using the lombok api

  1. Device, 2. DeviceList
@Data
public class Device {
    private String id;   
    private String alternateId;   
    private Double temperature;   
    private Double pressure;   
    private String channelId;   
}

@Data : It generates getters for all fields, a useful toString method, and hashCode and equals implementations that check all non-transient fields. Will also generate setters for all non-final fields, as well as a constructor.

@Data
public class DeviceList {
    private List<Device> deviceLst;  
}

 

3. Configuration and Implementations

3.1 Create a Rest controller class mapping of get and post method:

@Slf4j
@RequestMapping("/toparquet")
@RestController
public class JsonToParquetConverter {
    public static final String FILE_EXTENSION = ".parquet";
//    private AwsClientImpl awsClientImpl;
    public JsonToParquetConverter(AwsClientImpl awsClientImpl) {
        this.awsClientImpl = awsClientImpl;
    }
   @PostMapping(value = "/convert")
    @ResponseStatus(HttpStatus.OK)
    public String convertJsonToParquet(@RequestBody @Validated DeviceList deviceList) {
        log.info("******************Inside the convertJsonToParquet****************");
        Map<String, String> chronoMap = getFolderBasedOnTimestamp(Instant.now().toEpochMilli());
        File parquetFile = null;
        try {
            parquetFile = convertToParquet(deviceList.getDeviceLst());
            if (parquetFile != null && parquetFile.getPath() != null) {
                final InputStream parquetStream = new DataInputStream(
                        new FileInputStream(parquetFile.getAbsoluteFile()));
                String fileName =  System.currentTimeMillis() * 1000 + FILE_EXTENSION;
               // to store the converted data to AWS S3
               // this.awsClientImpl.uploadToS3(chronoMap, fileName, parquetStream);
            }
        } catch (Exception e) {
            log.error("exception {}", e);
            e.printStackTrace();
        }
        return " Covert from Json to Parquet File Sucessful !!!";
    }

As we know that in Spring-boot based application is an annotation driven apps, so we just need to annotate the class as shown below like

@RequestMapping(“/toparquet”)  :  Using this annotation for mapping web requests onto methods in request-handling classes with flexible method signatures.

@RestController: Using this annotation to define the class as rest end points class. Similarly, other annotation like GetMapping, PostMapping, RequestStatus etc.

@Slf4j  it helps lombok to generate a logger field and easily we can log all the logs.

3.2 Parquet conversion method:

Before going to parquet conversion from json object, let us understand the parquet file format. Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. It is binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression. It is especially good for queries which read particular columns from a “wide” (with many columns) table since only needed columns are read and IO is minimized. When we are processing Big data, cost required to store such data is more (Hadoop stores data redundantly i.e. 3 copies of each file to achieve fault tolerance) along with the storage cost processing the data comes with CPU, Network IO, etc. costs. As the data increases cost for processing and storage increases. Parquet is the choice of Big data as it serves both needs, efficient and performance in both storage and processing. This results in a file that is optimized for query performance and minimizing I/O. Specifically, it has the following characteristics:

  • Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row-based like CSV
  • Apache Parquet is built from the ground up with complex nested data structures in mind
  • Apache Parquet is built to support very efficient compression and encoding schemes (seeGoogle Snappy)
  • Apache Parquet allows to lower storage costs for data files and maximizes the effectiveness of querying data with serverless technologies like Amazon Athena, Redshift Spectrum, BigQuery, and Azure Data Lakes.
  • Licensed under the Apache software foundation and available to any project.

The Parquet “big data” association may give an impression that the format is limited to specific use cases. As Parquet has moved out of the shadow of complex Hadoop big data solutions

To know more about the parquet file format, refer the below link

3.3 Spark Library usage to convert it to parquet File format:

Here we are using the spark library to  convert the json data to parquet format, the main advantage of using the library is that provide any form of complex json format, it will convert it to parquet, however there are other library which do the same thing like avro-parquet library but in that case, if the json structure is generic or if it nested to more than 3 level, then in that case it will not be able to convert it. In that case we need to create the parquet schema by reading the 1st set of records before converting it to parquet format(which I will show in my future blog), here in spark library approach also it first read the  schema, Because in any case if we need to convert the json or text file to any other format to parquet format, But here we no need to read the records beforehand it just scan the input data and library it self creates the schema out of it, then it covert the input data to parquet format.

3.4 Method to convert json to parquet File format:

The following method needs is using the JavaSparkContext, SparkSession object to create session and read the schema and convert the data to parquet format. It first writes it to temporary files and then then the parquet object can be stored or upload it into AWS S3 bucket. Which will be explained in the next part of the blog.

    /**
     * @param list
     * @return temp file
     *@Conver the input json data to parquet format
     */
    private File convertToParquet(List<Device> list) {
        JavaSparkContext sparkContext = null;
        File tempFile = null;
        try (SparkSession spark = SparkSession.builder()
                .master("local[4]")
                .appName("ConvertorApp")
                .getOrCreate()) {
            tempFile = this.createTempFile();
            Gson gson = new Gson();
            List<String> data = Arrays.asList(gson.toJson(list));
            sparkContext = JavaSparkContext.fromSparkContext(SparkContext.getOrCreate());
            Dataset<String> stringDataSet = spark.createDataset(data, Encoders.STRING());
            Dataset<Row> parquetDataSet = spark.read().json(stringDataSet);
            log.info("Inserted json conversion schema and value");
            parquetDataSet.printSchema();
            parquetDataSet.show();
            if (tempFile != null) {
                parquetDataSet.write().parquet(tempFile.getPath());
                tempFile = this.retrieveParquetFileFromPath(tempFile);
            }
        } catch (Exception ex) {
            log.error("Stack Trace: {}", ex);
        } finally {
            if (sparkContext != null) {
                sparkContext.close();
            }
        }
        return tempFile;
    }
//Create the temp file path to copy converted parquet data
    private File createTempFile() throws IOException {
        Path tempPath = Files.createTempDirectory("");
        File tempFile = tempPath.toFile();
        if (tempFile != null && tempFile.exists()) {
            String tempFilePath = tempFile.getAbsolutePath();
            tempFile.deleteOnExit();
            Files.deleteIfExists(tempFile.toPath());
            log.debug("Deleted tempFile[ {} ]}", tempFilePath);
        }
        return tempFile;
    }
//Retrieve the parquet file path
    private File retrieveParquetFileFromPath(File tempFilePath) {
        List<File> files = Arrays.asList(tempFilePath.listFiles());
        return files.stream()
                .filter(
                    tmpFile -> tmpFile.getPath().contains(FILE_EXTENSION) && tmpFile.getPath().endsWith(FILE_EXTENSION))
                .findAny()
                .orElse(null);
    }

In the above code snippet convertToParquet() method to convert json data to parquet format data using spark library. createTempFile() method used to create a temp file in the jvm to temporary store the parquet converted data before pushing/storing it to AWS S3.

4. Testing the Rest Services

Publish the application in cloud foundry or run locally in Eclipse IDE using spring boot App configurations

As we can see that above the mentioned payload only two level of nested json object. The main advantage of using the spark library is we can pass any payload it will convert it to columnar storage of parquet. But the main disadvantage of spark library, it makes the application jar fat, by almost 120 MB. So, if we are going to use this approach then needs be prepared that application jar size would be huge.

Miscellaneous hacks: As you must have notices above in the parquet conversion code.

parquetDataSet.printSchema(): Prints the Schema of the input data. In my case sample of the schema is shown below

parquetDataSet.show(): It will print the data after conversion to parquet format as show below

Conclusion

The main intention of this blog is to show how we can convert the json data to parquet format data using the Apache spark library in real time. Using this approach we can easily reduce the storage cost in cloud, As we know that in cloud we are charged based on the volume of data stored. While implementing in my project I searched a lot but didn’t get proper example of end-to end to implementation in java. In the next blog the conversion of json to CSV, json to ORC (Optimized Row Columnar) file format.

 

 

Few of my other blogs:-

Be the first to leave a comment
You must be Logged on to comment or reply to a post.