Executor error during find command caused by operation exceeded time limit - Исправление ошибок и поиск оптимальных решений проблем

Необычно короткий пост-заметка о наболевшем.

— «я не DBA, я ваши базы не знаю» — хочется мне порой сказать. Как и многим из вас наверное. Однако все таки это не правильное мнение и сейчас я расскажу вам свежую историю о том, как знание баз данных (даже на оч простом уровне) помогает в работе.

Какое-то время назад я писал заметки по mongodb. И вот они мне пригодились — а конкретно про explain запроса)

Короче, история такая — принимаем мы значит у разработки релиз на выкатку в prod. Все вроде хорошо, но вот не задача — на stage падает один автотест (по логам нам прилетает 500-ая ошибка)… как раз на свежую фичу бекенда, которая должна была в этом релизе выкатиться (собственно ради нее все и затевалось).

Ребята, которые у нас занимаются доставкой потыкались — нашли только в логах, что во время этого теста, в ответ на полученный по api запрос, сервер в лог помимо прочего пишет ошибку MongodbNetworkError. И все (из полезного).
Разработчики (конкретная команда ответственная за фичу) — завели привычную песенку «у меня все работает», только звучало это как — «у нас на регрессионном стенде тест проходит это у вас на стейдже с базой какие то проблемы. Не мешайте нам выкатываться- стейдж чините».

Ну что… Пришлось вспомнить что я еще играющий тимлид, расчехлить консольку и встать на защиту чести стейджа)
А дальше — расследование.

Первое что я сделал- собрал все запросы,которые выполнял тест. Посмотрел в логах (на самом сервере, не в elk) — время когда запрос пришел и когда мы на него ответили 500 (ошибка с монгой скрывается внутри, речь идет именно про request-reply с клиентом) — и бац, все запросики у нас выходили по 30 секунд. Ровно! Подозрительно похоже на таймаут.

Ок, расковыриваю что за запрос делается (у нас приложение логирует какой query он кидает в базу) и иду выполняю его вручную через mongoshell — ага, повисает… Добавляю к запросу параметр .maxTimeMS(30000) — получаю ошибку запросы что он не укладывается в 30 сек:

«errmsg» : «Executor error during find command :: caused by :: errmsg: «operation exceeded time limit»»

ну это 100 из 100 чт где то кто то с базой продолбался… Иду на регрессию и выполняю похожий запрос (данные запроса отличаются, характер тот же) — отрабатывает.

Начинаю сравнивать — под эту фичу у нас в документ определенной коллекции добавлялись несколько полей и для быстрого поиска по ним строился индекс. На проде ради этого делали отдельное долгое изменение по всем базам (ну прикиньте индекс по 4тб построить).

Результат изысканий:

поля есть и на регресии и в стейдже
индекс построен и там и там
индекс одинаковый (по структуре)
данные в индексах есть (то есть они сами не пустые).

Сижу чешу репу — что не так… И тут вспоминаю про благословенный exlain…
добавляю к запросу .explain() и что я вижу:

...

«winningPlan» : {

«stage» : «COLLSCAN»

...

То есть эта сволочь не пытается работать с индексом а идет и просто перебором ищет по всей коллекции. Конечно мы так в 30 сек таймаута не уложимся — на стейдже коллекции большие. Не как в проде но все равно — десятки гигабай.

Ок, думаю — как оно тогда в регресии то работает? Индексы же одинаковые по структуре. Кидаю запрос в регрессионную базу — то же самое, скан коллекции.

Это означает что индекс не работает, а точнее что запрос не попадает в него и база вынуждена искать тупым монотонным перебором.

Так почему же в регрессии запрос при этом отрабатывал а в стейдже падал с таймаутом?
Ларчик просто открывался- размер этой коллекции в регрессии — 133 мегабайта, в стейдже — 175 гигов. В проде естественно еще хуже)))

Пошел натыкал в это разработку. Другая команда благо пришла на помощь и быстро состряпала запрос который давал бы тот же результат но попадал в созданный ранее индекс. Теперь авторы сидят исправляют — ждем следующий релиз с хот фиксом. Вот так вот…

Источник

Anyone seeing the above error on the oplog cursor? My configuration is I have a remote machine which is my oplog source, and my oplog application is running locally pointing to a local mongoDB instance. After some time and many insert, update and delete events, I will eventually get the following. I am using mongoDB 3.4.x locally and my remote machine is 3.2.8.

The following appears to occur sometime when questions of connection and query length reach the internal MongoDB limit cursor.maxTimeMS. Can anyone tell me or suggest a sufficient reason that the code would get this? stream.js appears to be correctly coded, is there an issue elsewhere?

    tailable: true,
    awaitData: true,
    oplogReplay: true,
    noCursorTimeout: true,

Error is:

  startOplog 0] ------- Error ------- +10s
  startOplog-details 0] Error object: { MongoError: operation exceeded time limit
    at Function.MongoError.create (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/error.js:31:11)
    at /home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:497:72
    at authenticateStragglers (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:443:16)
    at Connection.messageHandler (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:477:5)
    at TLSSocket.<anonymous> (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/connection.js:331:22)
    at emitOne (events.js:115:13)
    at TLSSocket.emit (events.js:210:7)
    at addChunk (_stream_readable.js:252:12)
    at readableAddChunk (_stream_readable.js:239:11)
    at TLSSocket.Readable.push (_stream_readable.js:197:10)
  name: 'MongoError',
  message: 'operation exceeded time limit',
  ok: 0,
  errmsg: 'operation exceeded time limit',
  code: 50 } +10s
  startOplog 0] ------- Error ------- +1ms
  mongo-oplog stream ended +1ms
  startOplog 0] ------- End ------- +0ms
  startOplog-details 0] End object: undefined +0ms
  startOplog 0] ------- End ------- +0ms
  mongo-oplog oplog error {"name":"MongoError","message":"operation exceeded time limit","ok":0,"errmsg":"operation exceeded time limit","code":50} +1ms
  startOplog 2] ------- Error ------- +1ms

I have made the following changes to stream.js which have made the symptoms better, but not perfect.

  async function time() {
    if (ts) return (typeof ts !== 'number') ? ts : Timestamp(0, ts)

    const doc = await coll
      .find({ }, { ts: 1 })
      .sort({ $natural: -1 })
      .limit(1)
      .nextObject()

    //return doc ? doc.ts : Timestamp(0, (Date.now() / 1000 | 0))
    return doc ? doc.ts : Timestamp(0, 0)
  }

  if (ns) query.ns = { $regex: regex(ns) }
  query.ts = { $gt: await time() }

  return (await coll.find(query, {
    "tailable": true,
    "awaitData": true,
    "oplogReplay": true,
    "noCursorTimeout": true,
    "maxTimeMS": 200,
    "numberOfRetries": Number.MAX_VALUE
  })).stream()

I am also trying the following, but do not have enough data points on how that will do yet.

  return (await coll.find(query, {
    "tailable": true,
    "awaitData": true,
    "oplogReplay": true,
    "maxTimeMS": 100,
    "numberOfRetries": Number.MAX_VALUE
  }).addCursorFlag('noCursorTimeout',true)).stream()

So I chased this far enough to know that in the mongodb npm driver, collection.js checks for and translates timeout, ie.

  // Translate to new command option noCursorTimeout
  if(typeof newOptions.timeout == 'boolean') newOptions.noCursorTimeout = newOptions.timeout;

Therefore I made the following change to see if that would help at all.

    const doc = await coll
      .find({ }, { ts: 1, timeout: false })
      .sort({ $natural: -1 })
      .limit(1)
      .nextObject()

    //return doc ? doc.ts : Timestamp(0, (Date.now() / 1000 | 0))
    return doc ? doc.ts : Timestamp(0, 0)
  }

  if (ns) query.ns = { $regex: regex(ns) }
  query.ts = { $gt: await time() }

  return (await coll.find(query, {
    "tailable": true,
    "awaitData": true,
    "oplogReplay": true,
    "maxTimeMS": 100,
    "timeout": false,
    "numberOfRetries": Number.MAX_VALUE
  })).stream()

After some time it still throws the error, not on just one of my oplog streams, but all.

  startOplog 1] ------- End ------- +0ms
  startOplog-details +0ms] NO End object data
  startOplog 1] ------- End ------- +0ms
  mongo-oplog oplog error {"name":"MongoError","message":"operation exceeded time limit","ok":0,"errmsg":"operation exceeded time limit","code":50} +0ms
  startOplog 0] ------- Error ------- +0ms
  startOplog-details 0] Error object: { MongoError: operation exceeded time limit
    at Function.MongoError.create (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/error.js:31:11)
    at /home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:497:72
    at authenticateStragglers (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:443:16)
    at Connection.messageHandler (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:477:5)
    at TLSSocket.<anonymous> (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/connection.js:333:22)
    at emitOne (events.js:115:13)
    at TLSSocket.emit (events.js:210:7)
    at addChunk (_stream_readable.js:252:12)
    at readableAddChunk (_stream_readable.js:239:11)
    at TLSSocket.Readable.push (_stream_readable.js:197:10)
  name: 'MongoError',
  message: 'operation exceeded time limit',
  ok: 0,
  errmsg: 'operation exceeded time limit',
  code: 50 } +0ms

Next steps I hacked mongodb npm module, lib/collection.js

  // Translate to new command option noCursorTimeout
  if(typeof newOptions.timeout == 'boolean') newOptions.noCursorTimeout = newOptions.timeout;
  newOptions.noCursorTimeout = false;

  // Merge in options to command
  for(var name in newOptions) {
    if(newOptions[name] != null) {
 console.log("coll.js opt[%s]: %s", name, newOptions[name]);
      findCommand[name] = newOptions[name];
    }
  }

I see its getting set, not sure about its use yet:

coll.js opt[readPreference]: [object Object]
coll.js opt[tailable]: true
coll.js opt[awaitData]: true
coll.js opt[oplogReplay]: true
coll.js opt[maxTimeMS]: 100
coll.js opt[timeout]: false
coll.js opt[numberOfRetries]: 1.7976931348623157e+308
coll.js opt[skip]: 0
coll.js opt[limit]: 0
coll.js opt[slaveOk]: true
coll.js opt[noCursorTimeout]: false

I think I found the real issue, the index.js code is not restarting the tail on timeout. So I am testing the following.

  function onerror(err) {
    if (/cursor (killed or )?timed out/.test(err.message)) {
      debug('cursor timeout - re-tailing %j', err)
      tail()
    } else if (/exceeded time limit/.test(err.message)) {
      debug('cursor exceeded timeout - re-tailing %j', err)
      tail()
    } else {
      debug('oplog error %j', err)
      oplog.emit('error', err)
    }
  }

The above appears to work, I will create a PR to contrib this back. Since I am using TLS connections across the wire this may have something to do with you missing this particular cursor error, as well as perhaps the versions of node and npm I am using.

Hey, did you ever create a PR? Was this issue solved? Seeing some problems with oplog tailing here as well, and just want to make sure which path to chase down on.

@heipei No, I never created a PR for the above, but the above does appear to work for us. Its strange since I am not sure what triggers the other message and have not been able to track down the cause, other than possible internal private network contstraints (we do not use external IPs with oplog).

Источник

It may happen that if we run a sort operation on a collection, we encounter this error:

Error: error: {
“ok” : 0,
“errmsg” : “Executor error during find command :: caused by :: errmsg: ”Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.””,
“code” : 96,
“codeName” : “OperationFailed”
}

When doing a sort MongoDb first attempts to retrieve the documents using the order specified in an index. When no index is available it will try to load the documents into memory and sort them there. The catch is that Mongo is configured by default to abort the operation when exceeding an usage of 32 MB. In that case we run into the error described above.

There are various ways to solve this. The most obvious one would be to add the missing index. Another way would be to modify the configuration settings to increase the RAM limit:

db.adminCommand({setParameter: 1, internalQueryExecMaxBlockingSortBytes: 335544320})

With the above command I have setted the RAM limit to 320 MB. If I wanted to check the RAM limit I would run this command:

db.runCommand( { getParameter : 1, "internalQueryExecMaxBlockingSortBytes" : 1 } )

Given that the above query returns the result in bytes you may want to use a conversion tool like this one.

Both commands must be executed in admin DB, otherwise the shell will return an unauthorised error.

Now if we do modify the RAM limit and try to run again the query we may notice – as mentioned before – that the performance can take quite a hit. We may have to wait a bit till the sorting is complete.

Another dirty alternative would be to use the aggregate command to the the sorting with the allowDiskUse set to true as stated in the MongoDB documentation:

Aggregation pipeline stages have maximum memory use limit. To handle large datasets, set allowDiskUse option to true to enable writing data to temporary files

What MongoDb is saying here is that aggregate like sort has a maximum of RAM configured, but when exceeded instead of abort the operation it will continue using disk file storage instead of RAM, if the allowDiskUse has been set to true.

db.getCollection('movies').aggregate( [
      { $sort : { year : 1} }
   ],
   { allowDiskUse: true }
)

As with modifying the RAM configuration this approach will do the trick, but performance wise is in the low side. After we have moved from fast memory processing, to slow disk storage.

If we want to get a good preformance when sorting collection we have to create an index:

db.movies.createIndex( { year: 1 } )

I created an index on the movies collection by year. Now running this query will not only avoid the ram exceeded error, but will be also sort significantly quicker:

db.getCollection('movies').find({}).sort({year:1})

Источник

Читайте также: