Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twofishes indexer does not handle MongoDB query failures properly. #53

Open
piranha32 opened this issue Mar 20, 2018 · 2 comments
Open

Comments

@piranha32
Copy link

During importing the data into the DB with "./src/jvm/io/fsq/twofishes/scripts/parse.py -w pwd/data/" the indexer crashed with the following error message. After the crash the indexer was stuck doing nothing, and the no more records were processed. Processed data was downloaded with "./src/jvm/io/fsq/twofishes/scripts/download-world.sh"

8246718 INFO  i.f.t.indexer.output.PrefixIndexer - done with 114000 of 2994876 prefixes
8247980 INFO  i.f.t.indexer.output.PrefixIndexer - done with 115000 of 2994876 prefixes
Exception in thread "main" io.fsq.rogue.RogueException: Mongo query on geocoder [db.name_index.find({ "name" : { "$regex" : "^?" , "$options" : ""} , "excludeFromPrefixIndex" : { "$ne" : true}}).sort({ "pop" : -1}).limit(1000)] failed after 2 ms
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.runCommand(BlockingMongoClientAdapter.scala:118)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.runCommand(BlockingMongoClientAdapter.scala:52)
        at io.fsq.rogue.adapter.MongoClientAdapter.queryRunner(MongoClientAdapter.scala:471)
        at io.fsq.rogue.adapter.MongoClientAdapter.query(MongoClientAdapter.scala:496)
        at io.fsq.rogue.query.QueryExecutor.fetch(QueryExecutor.scala:144)
        at io.fsq.twofishes.indexer.output.PrefixIndexer.getRecordsByPrefix(PrefixIndexer.scala:76)
        at io.fsq.twofishes.indexer.output.PrefixIndexer$$anonfun$writeIndexImpl$2.apply(PrefixIndexer.scala:115)
        at io.fsq.twofishes.indexer.output.PrefixIndexer$$anonfun$writeIndexImpl$2.apply(PrefixIndexer.scala:110)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at io.fsq.twofishes.indexer.output.PrefixIndexer.writeIndexImpl(PrefixIndexer.scala:110)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply$mcV$sp(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.util.DurationUtils$.inNanoseconds(DurationUtils.scala:16)
        at io.fsq.twofishes.util.DurationUtils$class.logDuration(DurationUtils.scala:23)
        at io.fsq.twofishes.indexer.output.Indexer.logDuration(Indexer.scala:37)
        at io.fsq.twofishes.indexer.output.Indexer.writeIndex(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.NameIndexer.writeIndexImpl(NameIndexer.scala:84)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply$mcV$sp(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.util.DurationUtils$.inNanoseconds(DurationUtils.scala:16)
        at io.fsq.twofishes.util.DurationUtils$class.logDuration(DurationUtils.scala:23)
        at io.fsq.twofishes.indexer.output.Indexer.logDuration(Indexer.scala:37)
        at io.fsq.twofishes.indexer.output.Indexer.writeIndex(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.OutputIndexes.buildIndexes(OutputHFile.scala:28)
        at io.fsq.twofishes.indexer.importers.geonames.GeonamesParser$.writeIndexes(GeonamesParser.scala:144)
        at io.fsq.twofishes.indexer.importers.geonames.GeonamesParser$.main(GeonamesParser.scala:106)
        at io.fsq.twofishes.indexer.importers.geonames.GeonamesParser.main(GeonamesParser.scala)
Caused by: com.mongodb.MongoQueryException: Query failed with error code 2 and error message 'Regular expression is invalid: invalid UTF-8 string' on server 127.0.0.1:27017
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:521)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:510)
        at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:431)
        at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:404)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:510)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:81)
        at com.mongodb.Mongo.execute(Mongo.java:836)
        at com.mongodb.Mongo$2.execute(Mongo.java:823)
        at com.mongodb.OperationIterable.iterator(OperationIterable.java:47)
        at com.mongodb.OperationIterable.forEach(OperationIterable.java:70)
        at com.mongodb.FindIterableImpl.forEach(FindIterableImpl.java:166)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.forEachProcessor(BlockingMongoClientAdapter.scala:162)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.forEachProcessor(BlockingMongoClientAdapter.scala:52)
        at io.fsq.rogue.adapter.MongoClientAdapter$$anonfun$query$1.apply(MongoClientAdapter.scala:496)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.findImpl(BlockingMongoClientAdapter.scala:272)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.findImpl(BlockingMongoClientAdapter.scala:52)
        at io.fsq.rogue.adapter.MongoClientAdapter$$anonfun$queryRunner$1.apply(MongoClientAdapter.scala:472)
        at io.fsq.rogue.util.DefaultQueryLogger.onExecuteQuery(QueryLogger.scala:52)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.runCommand(BlockingMongoClientAdapter.scala:113)
        ... 30 more

@piranha32
Copy link
Author

Subsequent execution of parse.py with the same input data, and without deleting the output directory completed successfully, but the index seems to be corrupted. Twofishes starts, responds to queries without crashing, but throws exceptions on each query:

WARNING: Unexpected EOF reading file:data/prefix_index/index at entry #0.  Ignoring.
Mar 21, 2018 1:27:08 AM io.fsq.twofishes.server.HandleExceptions$$anonfun$apply$6 applyOrElse
SEVERE: got error: java.io.EOFException: reached EOF while trying to read 13 bytes
java.io.EOFException: reached EOF while trying to read 13 bytes
        at io.fsq.twofishes.core.MMapInputStreamImpl.read(MMapInputStream.scala:91)
        at io.fsq.twofishes.core.MMapInputStream.read(MMapInputStream.scala:160)
        at io.fsq.twofishes.core.MMapInputStream.read(MMapInputStream.scala:139)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:120)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2359)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2258)
        at io.fsq.twofishes.core.MapFileConcurrentReader.findPosition(MapFileConcurrentReader.java:245)
        at io.fsq.twofishes.core.MapFileConcurrentReader.get(MapFileConcurrentReader.java:279)
        at io.fsq.twofishes.server.MapFileInput$$anonfun$6.apply(HFileStorageService.scala:205)
        at io.fsq.twofishes.server.MapFileInput$$anonfun$6.apply(HFileStorageService.scala:203)
        at io.fsq.twofishes.util.DurationUtils$.inMilliseconds(DurationUtils.scala:10)
        at io.fsq.twofishes.server.MapFileInput.lookup(HFileStorageService.scala:203)
        at io.fsq.twofishes.server.PrefixIndexMapFileInput.get(HFileStorageService.scala:310)
        at io.fsq.twofishes.server.NameIndexHFileInput.getPrefix(HFileStorageService.scala:244)
        at io.fsq.twofishes.server.HFileStorageService.getIdsByNamePrefix(HFileStorageService.scala:54)
        at io.fsq.twofishes.server.HotfixableGeocodeStorageService.getIdsByNamePrefix(HotfixableGeocodeStorageService.scala:22)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl$$anonfun$9.apply(AutocompleteGeocoderImpl.scala:220)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl$$anonfun$9.apply(AutocompleteGeocoderImpl.scala:198)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.Range.foreach(Range.scala:160)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.generateAutoParsesHelper(AutocompleteGeocoderImpl.scala:198)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.generateAutoParses(AutocompleteGeocoderImpl.scala:275)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.doGeocodeImpl(AutocompleteGeocoderImpl.scala:294)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.doGeocodeImpl(AutocompleteGeocoderImpl.scala:31)
        at io.fsq.twofishes.server.AbstractGeocoderImpl$$anonfun$doGeocode$1.apply(AbstractGeocoderImpl.scala:33)
        at com.twitter.util.Duration$.inMilliseconds(Duration.scala:183)
        at com.twitter.ostrich.stats.StatsProvider$class.time(StatsProvider.scala:196)
        at com.twitter.ostrich.stats.StatsCollection.time(StatsCollection.scala:31)
        at io.fsq.twofishes.server.AbstractGeocoderImpl$Stats$.time(AbstractGeocoderImpl.scala:43)
        at io.fsq.twofishes.server.AbstractGeocoderImpl.doGeocode(AbstractGeocoderImpl.scala:32)
        at io.fsq.twofishes.server.GeocodeRequestDispatcher.geocode(GeocodeRequestDispatcher.scala:28)
        at io.fsq.twofishes.server.GeocodeServerImpl$$anonfun$geocode$2.apply(GeocodeServer.scala:218)
        at io.fsq.twofishes.server.GeocodeServerImpl$$anonfun$geocode$2.apply(GeocodeServer.scala:218)
        at com.twitter.util.Try$.apply(Try.scala:13)
        at com.twitter.util.ExecutorServiceFuturePool$$anon$2.run(FuturePool.scala:115)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

@jglesner
Copy link
Contributor

Preface this 'solution' by saying that I'm not a Scala programmer ... so there may be a better way to handle this. However, I was able to temporarily get past this build error, and avoid the corrupted index by editing fsqio/src/jvm/io/fsq/twofishes/indexer/output/PrefixIndexer.scala to add a try/catch block around the section of code where the error occurs (example below). In the current version of this file, that means enclosing lines 115 through 140.

    for {
      (prefix, index) <- sortedPrefixes.zipWithIndex
    } {
      if (index % 1000 == 0) {
        log.info("done with %d of %d prefixes".format(index, numPrefixes))
      }
      try {
         val records = getRecordsByPrefix(prefix, PrefixIndexer.MaxNamesToConsider)

         val (woeMatches, woeMismatches) = records.partition(r =>
           bestWoeTypes.contains(r.woeTypeOrThrow))

         val (prefSortedRecords, unprefSortedRecords) =
           sortRecordsByNames(woeMatches.toList)

         val fids = new HashSet[StoredFeatureId]
         //roundRobinByCountryCode(prefSortedRecords).foreach(f => {
         prefSortedRecords.foreach(f => {
           if (fids.size < PrefixIndexer.MaxFidsToStorePerPrefix) {
             fids.add(f.fidAsFeatureId)
           }
         })

         if (fids.size < PrefixIndexer.MaxFidsWithPreferredNamesBeforeConsideringNonPreferred) {
           //roundRobinByCountryCode(unprefSortedRecords).foreach(f => {
           unprefSortedRecords.foreach(f => {
             if (fids.size < PrefixIndexer.MaxFidsToStorePerPrefix) {
               fids.add(f.fidAsFeatureId)
             }
           })
         }

         prefixWriter.append(prefix, fidsToCanonicalFids(fids.toList))
      } catch {
         case e: Exception => println("Skipping due to error processing prefixes")
      }
}

If you already built the mongo db, you'll obviously want to throw that out and start from mongod --dbpath /local/directory/for/output/ on the import instructions.

jglesner pushed a commit to CorkHounds/fsqio that referenced this issue Jul 12, 2018
To address Issue foursquare#53 (foursquare#53), I have wrapped prefix processing in a try/catch.  I am not a scala programmer, so this may not be the best implementation.  However, if we do not contain regex errors related to prefix processing, the processing fails, and corrupts the database/indices. Therefore, I feel like its worthwhile to trap the error here, and just skip over processing errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants