Spark reading from Oracle

By | 17th June 2021

In the other blog, we saw how to read a hive table in Spark. In this blog, we will see how to read data from Oracle

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *
from pyspark.sql.functions import lit
from pyspark.sql import HiveContext
from pyspark.sql import SparkSession

appname = "Application name"
spark = SparkSession.builder.appName(appname).getOrCreate()
oracleDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:ORACLE_SERVER:PORT/SID") \
    .option("dbtable", "database.tablename") \
    .option("user", "username") \
    .option("password", "*****") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .load()
print(oracleDF.count)

This will load the data from the Oracle table to the data frame. After that, we can perform any operation as per the program needs

We need to pass the required odbc jar for the spark program to establish the connection with Oracle

Shell script to call python

ODBC_JAR="/localpapth/ojdbc6.jar"
    spark-submit --master yarn --deploy-mode client\
      --conf spark.dynamicAllocation.enabled=true \
      --conf spark.dynamicAllocation.minExecutors=1 \
      --conf spark.dynamicAllocation.maxExecutors=30 \
      --conf spark.dynamicAllocation.initialExecutors=1 \
      --jars ${ODBC_JAR} \
      ${SPARK_SCRIPT}

If you notice, we are passing the ${ODBC_JAR} for the Spark application. This is the library with driver details for connecting to Oracle

Also read,

  1. Spark reading from Hive table
  2. Spark execution modes