AWS: GLUE

What is AWS Glue.

AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load (ETL) processes. Glue also supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud (EC2) instances in an Amazon Virtual Private Cloud. AWS Glue is a fully-managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics.

AWS Glue is a fully managed ETL service. This service makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it swiftly and reliably between various data stores.
It comprises of components such as a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries.
AWS Glue is serverless, this means that there’s no infrastructure to set up or manage.

AWS Glue Concepts

You define jobs in AWS Glue to accomplish the work that’s required to extract, transform, and load (ETL) data from a data source to a data target. You typically perform the following actions:

Firstly, you define a crawler to populate your AWS Glue Data Catalog with metadata table definitions. You point your crawler at a data store, and the crawler creates table definitions in the Data Catalog.
In addition to table definitions, the Data Catalog contains other metadata that is required to define ETL jobs. You use this metadata when you define a job to transform your data.
AWS Glue can generate a script to transform your data or you can also provide the script in the AWS Glue console or API.
You can run your job on-demand, or you can set it up to start when a specified trigger occurs. The trigger can be a time-based schedule or an event.
When your job runs, a script extracts data from your data source, transforms the data, and loads it to your data target. This script runs in an Apache Spark environment in AWS Glue.

AWS Glue Terminology

Terminology	Description
Data Catalog	The persistent metadata store in AWS Glue. It contains table definitions, job definitions, and other control information to manage your AWS Glue environment.
Classifier	Determines the schema of your data. AWS Glue provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others.
Connection	It contains the properties that are required to connect to your data store.
Crawler	A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the Data Catalog.
Database	A set of associated Data Catalog table definitions organized into a logical group in AWS Glue.
Data Store, Data Source, Data Target	A data store is a repository for persistently storing your data. Data source is a data store that is used as input to a process or transform. A data target is a data store that a process or transform writes to.
Development Endpoint	An environment that you can use to develop and test your AWS Glue ETL scripts.
Job	The business logic is required to perform ETL work. It is composed of a transformation script, data sources, and data targets.
Notebook Server	A web-based environment that you can use to run your PySpark statements. PySpark is a Python dialect for ETL programming.
Script	Code that extracts data from sources, transforms it and loads it into targets. AWS Glue generates PySpark or Scala scripts.
Table	It is the metadata definition that represents your data. A table defines the schema of your data.
Transform	You use the code logic to manipulate your data into a different format.
Trigger	Initiates an ETL job. You can define triggers based on a scheduled time or event.

When will I use AWS Glue for Streaming ?

AWS Glue is recommended for Streaming when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform

How to launch the Spark history server ?

We can launch the Spark history server using a AWS CloudFormation template that hosts the server on an EC2 instance, or launch locally using Docker.

How does a glue crawler determine when to create partitions ?

When an AWS Glue crawler scans Amazon S3 path and if detects multiple folders in a bucket, it determines the root of a table in the folder structure and which folders are partitions of a table. The name of the table is based on the Amazon S3 prefix or folder name. You provide an Include path that points to the folder level to crawl. When the majority of schema at a folder level are similar, the crawler creates partitions of a table instead of two separate tables.

When do I use a Glue Classifier ?

You use classifiers when you crawl a data store to define metadata tables in the AWS Glue Data Catalog. You can set up your crawler with an ordered set of classifiers. When the crawler invokes a classifier, the classifier determines whether the data is recognized. If the classifier can’t recognize the data or is not 100 percent certain, the crawler invokes the next classifier in the list to determine whether it can recognize the data.

How to import data from my existing Apache Hive Metastore to the AWS Glue Data Catalog ?

Run an ETL job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog.

What is Time-Based Schedules for Jobs and Crawlers ?

We can define a time-based schedule for your crawlers and jobs in AWS Glue. You specify time in Coordinated Universal Time (UTC), and the minimum precision for a schedule is 5 minutes.

What will happens when a crawler Runs?

When a crawler runs, it takes the following actions to interrogate a data store:

Classifies data to determine the format, schema, and associated properties of the raw data – You can configure the results of classification by creating a custom classifier.

Groups data into tables or partitions – Data is grouped based on crawler heuristics.

Writes metadata to the Data Catalog – You can configure how the crawler adds, updates, and deletes tables and partitions.

What is Development Endpoints ?

The Development Endpoints API describes the AWS Glue API related to testing using a custom DevEndpoint. A development endpoint where a developer can remotely debug extract, transform, and load (ETL) scripts.

In Glue is it possible to trigger an AWS Glue crawler on new files, that get uploaded into a S3 bucket, given that the crawler is “pointed” to that bucket?

No, there is currently no direct way to invoke an AWS Glue crawler in response to an upload to an S3 bucket. S3 event notifications can only be sent to:

SNS

SQS

Lambda

Which Data Stores Can I Crawl using Glue?

Crawlers can crawl both file-based and table-based data stores.

Crawlers can crawl the following data stores through their respective native interfaces:

Amazon Simple Storage Service (Amazon S3)

Amazon DynamoDB

Crawlers can crawl the following data stores through a JDBC connection:

Amazon Redshift

Amazon Relational Database Service (Amazon RDS)

Amazon Aurora

Microsoft SQL Server

MySQL

Oracle

PostgreSQL

Publicly accessible databases

Aurora

Microsoft SQL Server

MySQL

Oracle

PostgreSQL

What is AWS Tags in AWS Glue ?

A tag is a label that you assign to an AWS resource. Each tag consists of a key and an optional value, both of which you define. You can use tags in AWS Glue to organize and identify your resources. Tags can be used to create cost accounting reports and restrict access to resources.

What is AWS Glue Metrics ?

When you interact with AWS Glue, it sends metrics to CloudWatch. You can view these metrics using the AWS Glue console (the preferred method), the CloudWatch console dashboard, or the AWS Command Line Interface (AWS CLI).

Is it possible to re-partition the data using AWS glue crawler?

You cant do it with help of crawler, however you can create new table manually in Athena.

Can we use Apache Spark web UI to monitor and debug AWS Glue ETL jobs ?

Yes, you can use the Apache Spark web UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system, and also Spark applications running on AWS Glue development endpoints. The Spark UI enables you to check the following for each job:

The event timeline of each Spark stage

A directed acyclic graph (DAG) of the job

Physical and logical plans for SparkSQL queries

The underlying Spark environmental variables for each job

What are the main components of AWS Glue ?

AWS Glue consists of a Data Catalog which is a central metadata repository, an ETL engine that can automatically generate Scala or Python code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries.

How to process MS Excel using Glue ?

As of now glue crawlers doesn’t support MS Excel files. If you want to create a table for the excel file you have to convert it first from excel to csv/json/parquet and then run crawler on the newly created file.

Explain AWS Glue Data Catalog ?

The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given data set, you can store its table definition, physical location, add business relevant attributes, as well as track how this data has changed over time.

What is AWS Glue Triggers ?

When fired, a trigger can start specified jobs and crawlers. A trigger fires on demand, based on a schedule, or based on a combination of events. A trigger can exist in one of several states. A trigger is either CREATED, ACTIVATED, or DEACTIVATED. There are also transitional states, such as ACTIVATING. To temporarily stop a trigger from firing, you can deactivate it. You can then reactivate it later.

Give some argument names used by AWS Glue internally that you cant set ?

–conf

–debug

–mode

–JOB_NAME

How does AWS Glue monitor dependencies ?

AWS Glue manages dependencies between two or more jobs or dependencies on external events using triggers. Triggers can watch one or more jobs as well as invoke one or more jobs.

How to get metadata into the AWS Glue Data Catalog ?

Glue crawlers scan various data stores you own to automatically infer schemas and partition structure and populate the Glue Data Catalog with corresponding table definitions and statistics.

What is bookmarks in AWS glue ?

AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

AWS

GLUE

No comments:

Post a Comment