sparta.extract#

Functions#

`read_with_schema`(→ pyspark.sql.DataFrame)	Function to read DataFrames with predefined schema.
`read_yaml_df`(→ pyspark.sql.DataFrame)	Reads a YAML file and converts it into a Spark DataFrame.

Module Contents#

sparta.extract.read_with_schema(path: str, schema: str, options: Dict[Any, Any] = None, format: str = 'csv', spark: pyspark.sql.SparkSession = None) → pyspark.sql.DataFrame#

Function to read DataFrames with predefined schema.

Parameters:

path (str) – Path where the file is located.
schema (str) – Pre-defined schema for reading.
options (dict) – Configuration options for reading the DataFrame.
format (str, optional) – Format of the file to be read. Defaults to ‘csv’.
spark (SparkSession, optional) – Spark session. Defaults to None.

Returns:

DataFrame read with predefined schema.

Return type:

DataFrame

Example

>>> schema = 'epidemiological_week LONG, date DATE, order_for_place INT, state STRING, city STRING, city_ibge_code LONG, place_type STRING, last_available_confirmed INT'
>>> path = '/content/sample_data/covid19-e0534be4ad17411e81305aba2d9194d9.csv'
>>> df = read_with_schema(path, schema, {'header': 'true'}, 'csv')

sparta.extract.read_yaml_df(path: str, spark: pyspark.sql.SparkSession = None) → pyspark.sql.DataFrame#

Reads a YAML file and converts it into a Spark DataFrame.

Parameters:

path (str) – The path of the YAML file to read.
spark (SparkSession, optional) – The Spark session to use for creating the DataFrame. If not provided, a new local Spark session is created by default.

Returns:

A Spark DataFrame containing the data from the YAML file.

Return type:

DataFrame

This function reads the YAML file at the specified path, converts it into a Python list, and then loads it into a Spark DataFrame. It also logs the conversion process. The function attempts to use CSafeLoader for faster YAML parsing but defaults to SafeLoader if unavailable.

Example

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName("YAML Reader").getOrCreate()
>>> df = read_yaml_df("data/sample.yaml", spark)
>>> df.show()

In this example, the function reads a YAML file located at “data/sample.yaml” and converts it into a Spark DataFrame using the provided Spark session.

Raises:

FileNotFoundError – If the YAML file at the specified path does not exist.
ValueError – If the YAML file contains invalid data that cannot be converted to a DataFrame.