In the other blog, we saw how to read a hive table in Spark. In this blog, we will see how to read data from Oracle
from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext, Row from pyspark.sql.types import * from pyspark.sql.functions import lit from pyspark.sql import HiveContext from pyspark.sql import SparkSession appname = "Application name" spark = SparkSession.builder.appName(appname).getOrCreate() oracleDF = spark.read \ .format("jdbc") \ .option("url", "jdbc:oracle:thin:ORACLE_SERVER:PORT/SID") \ .option("dbtable", "database.tablename") \ .option("user", "username") \ .option("password", "*****") \ .option("driver", "oracle.jdbc.driver.OracleDriver") \ .load() print(oracleDF.count)
This will load the data from the Oracle table to the data frame. After that, we can perform any operation as per the program needs
We need to pass the required odbc jar for the spark program to establish the connection with Oracle
Shell script to call python
ODBC_JAR="/localpapth/ojdbc6.jar" spark-submit --master yarn --deploy-mode client\ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.maxExecutors=30 \ --conf spark.dynamicAllocation.initialExecutors=1 \ --jars ${ODBC_JAR} \ ${SPARK_SCRIPT}
If you notice, we are passing the ${ODBC_JAR} for the Spark application. This is the library with driver details for connecting to Oracle
Also read,