數據源data/geo.csv:
代碼:
// 創建 SparkConf 對象
SparkConf conf = new SparkConf().setAppName("GeoSparkExample").setMaster("local[*]");
// 創建 SparkSession
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
// 註冊 SedonaSQL 函數
SedonaSQLRegistrator.registerAll(spark);
Dataset<Row> rawDf = spark.read().format("csv").option("header", "true") // 指定第一行作爲列名
.option("inferSchema", "true") // 推斷列的數據類型
.option("delimiter", ",") // 指定列分隔符,默認爲逗號
// 最後一條數據多邊形沒有關閉
.load("data/geo.csv");// 文件位置
rawDf.createOrReplaceTempView("rawdf");
rawDf.show();
System.out.println("======================================");
spark.sql("DESCRIBE rawdf").show();
spark.sql("DESCRIBE rawdf polygon").show();
// 將字符串類型的字段轉換爲 Sedona 的 Geometry 類型
Dataset<Row> result = spark.sql("SELECT ST_GeomFromWKT(regexp_replace(polygon, '\"', '')) AS geo FROM rawdf");
result.createOrReplaceTempView("test");
// 顯示查詢結果
result.show();
// 將字符串類型的字段轉換爲 Sedona 的 Geometry 類型
Dataset<Row> result = spark.sql("SELECT ST_GeomFromWKT(regexp_replace(polygon, '\"', '')) AS geo FROM rawdf");
result.createOrReplaceTempView("test");
23/07/21 15:57:02 ERROR FormatUtils: [Sedona] Points of LinearRing do not form a closed linestring
23/07/21 15:57:02 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 4)
java.lang.NullPointerException
at org.apache.sedona.sql.utils.GeometrySerializer$.getDimension(GeometrySerializer.scala:53)
at org.apache.sedona.sql.utils.GeometrySerializer$.serialize(GeometrySerializer.scala:37)
at org.apache.spark.sql.sedona_sql.expressions.implicits$GeometryEnhancer.toGenericArrayData(implicits.scala:79)
at org.apache.spark.sql.sedona_sql.expressions.ST_GeomFromWKT.eval(Constructors.scala:182)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:256)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:858)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:858)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
根據我的經驗分析:因爲原始多邊形範圍收尾沒有關閉導致了這個異常,那麼怎麼通過ST_IsValid方法來過濾掉這種數據了?讓程序不要報錯?
答案:ST_IsClosed(ST_GeomFromText(t.pg1))
https://sedona.apache.org/1.3.1-incubating/api/flink/Function/#st_isclosed