From XML
OSM XML files usually appear with the extension .osm
. Since the data is all string-based,
these files can be quite large compared to their PBF or ORC equivalents.
import org.apache.spark._
import scala.util.{Success, Failure}
import vectorpipe._
implicit val sc: SparkContext = new SparkContext(
new SparkConf().setMaster("local[*]").setAppName("xml-example")
)
val path: String = "/some/path/on/your/machine/foo.osm"
osm.fromLocalXML(path) match {
case Failure(e) => { } /* Parsing failed somehow... is the filepath correct? */
case Success((ns,ws,rs)) => { } /* (RDD[(Long, Node)], RDD[(Long, Way)], RDD[(Long, Relation)]) */
}
sc.stop()
From PBF
For the time being, .osm.pbf
files can be used by first converting them to .orc
files using the osm2orc tool, and then following
VectorPipe’s ORC instructions given below.
From ORC
You must first include an extra dependency to the libraryDependencies
list in your build.sbt
:
"org.apache.spark" %% "spark-hive" % "2.2.0"
And then we can read our OSM data in parallel via Spark. Notice the use of SparkSession
instead of SparkContext
here:
import org.apache.spark.sql._
import scala.util.{Success, Failure}
import vectorpipe._
implicit val ss: SparkSession =
SparkSession.builder.master("local[*]").appName("orc-example").enableHiveSupport.getOrCreate
val path: String = "s3://bucket/key/foo.orc"
osm.fromORC(path) match {
case Failure(err) => { } /* Does the file exist? Do you have the right AWS credentials? */
case Success((ns,ws,rs)) => { } /* (RDD[(Long, Node)], RDD[(Long, Way)], RDD[(Long, Relation)]) */
}
ss.stop()
This approach will be particularly efficient when run on an EMR cluster, since EMR clusters have privileged access to S3.