Skip to Content
Technical Articles

Upload of parquet or ORC file in AWS S3 bucket

Introduction

This is a continuation of previous blog, In this blog the file generated the during the conversion of parquet, ORC or CSV file from json as explained in the previous blog, will be uploaded in AWS S3 bucket. Even though the file like parquet and ORC is of type binary type, S3 provides a mechanism to view the parquet, CSV and text file.

What is AWS S3?

AWS S3 Amazon Simple Storage Service (S3) is a storage that can be maintained and accessed over the Internet. Amazon S3 provides the web service which can be used to store and retrieve unlimited amount of data. The same can be done programmatically using Amazon-provided APIs. To know more about S3 and creation of trial account in AWS.

Prerequisite Setup:

  1. Project setup: Refer section 2.1 of previous blog.
  2. Maven dependencies
<dependency>
 <groupId>com.amazonaws</groupId>
 <artifactId>aws-java-sdk-s3</artifactId>
 <version>1.11.246</version>
</dependency>

S3 file upload class and method:

We need to add the above dependencies in order to use the AWS S3 bucket. While uploading any file we need to convert the parquet, ORC or any other format data to InputStream object as to avoid the corrupt of data and then pass the data, type of file like .csv, .txt or .orc and the name of the bucket in which files need to be uploaded using the PUT method of the aws api, which is shown in the below code.

@Slf4j
@Component
public class AwsClientImpl {

    public void uploadToS3(Map<String, String> chronoMap, String fileName,
            InputStream contentStream) {
        ObjectMetadata objectMetadata = new ObjectMetadata();
        objectMetadata.addUserMetadata("transmitMessages", "False");
        objectMetadata.setContentType("binary/octet-stream");
        String s3Urinew = "s3://<bucket-name>/mytest/{day}";
        AmazonS3URI s3Uri = new AmazonS3URI(s3Urinew);
        String objectKey = StringSubstitutor.replace(s3Uri.getKey(), chronoMap, "{", "}")
                .concat(fileName);
        PutObjectRequest putObjectRequest = new PutObjectRequest(s3Uri.getBucket(), objectKey, contentStream,
                objectMetadata);
        AmazonS3 s3Client = this.buildAmazonS3Client();
        try {
            s3Client.putObject(putObjectRequest);
        } catch (AmazonServiceException ase) {

            String reason = String.format(
                "Encountered AmazonServiceException: \n"
                        + "HTTP Status Code: %d \n"
                        + "Error Message: %s \n"
                        + "AWS Error Code: %s /n"
                        + "Error Type: %s \n"
                        + "Request ID: %s \n",
                ase.getStatusCode(),
                ase.getMessage(),
                ase.getErrorCode(),
                ase.getErrorType().toString(),
                ase.getRequestId());
            log.error(reason, ase);
        } catch (AmazonClientException ace) {
            log.error("Encountered AmazonClientException: ", ace);
        }
    }

    public AmazonS3 buildAmazonS3Client() {
        String accessKeyId = "<accessId>";
        String secretAccessKey = "<secretAccesskey>";
        return AmazonS3ClientBuilder.standard().withRegion(Regions.EU_CENTRAL_1)
                .withCredentials(
                    new AWSStaticCredentialsProvider(new BasicAWSCredentials(
                            accessKeyId,
                            secretAccessKey)))
                .build();
    }
}

As mentioned in the above code snippet that we need to have bucket defined where inputstream data of any file format can be stored. We need to have secretId, secretaccess key and the bucket name in order to push the data via put method provided by aws sdk. builds3client method build and authenticate the api before pushing the data to the s3 bucket.

Sample output of .parquet file in AWS S3 bucket

Once the data successfully pushed via putobject method as shown above in the code, we can view the data in the S3 storage via in build file view tools as shown in the below screen shot

As shown in the screen shot we can view the data of type parquet, csv and text file.

 

Conclusion

This is the last blog of the series, In this blog, we are able to upload the converted data from json to .parquet, .csv or .orc file in the Inputstream to the specified AWS S3 bucket. However the sample application code will be uploaded in github.

Suggestions and questions are welcomed !!!

 

Few of my other blogs:-

 

Be the first to leave a comment
You must be Logged on to comment or reply to a post.