Character error rate python

Char Error RateВ¶ Module InterfaceВ¶ Character Error Rate (CER) is a metric of the performance of an automatic speech recognition (ASR) system. This value indicates the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CharErrorRate of 0 being a perfect score. […]

Содержание

  1. Char Error Rate¶
  2. Module Interface¶
  3. Functional Interface¶
  4. jiwer 2.5.1
  5. Навигация
  6. Ссылки проекта
  7. Статистика
  8. Метаданные
  9. Классификаторы
  10. Описание проекта
  11. JiWER: Similarity measures for automatic speech recognition evaluation
  12. Installation
  13. Usage
  14. pre-processing
  15. transforms
  16. Compose
  17. ReduceToListOfListOfWords
  18. ReduceToSingleSentence
  19. RemoveSpecificWords
  20. RemoveWhiteSpace
  21. RemovePunctuation
  22. RemoveMultipleSpaces
  23. Strip
  24. RemoveEmptyStrings
  25. ExpandCommonEnglishContractions
  26. Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER)
  27. Key concepts, examples, and Python implementation of measuring Optical Character Recognition output quality
  28. Contents
  29. Importance of Evaluation Metrics
  30. Error Rates and Levenshtein Distance
  31. Character Error Rate (CER)
  32. (i) Equation
  33. (ii) Illustration with Example
  34. (iii) CER Normalization
  35. (iv) What is a good CER value?
  36. Word Error Rate (WER)
  37. Python Example (with TesseractOCR and fastwer)
  38. Summing it up

Char Error Rate¶

Module Interface¶

Character Error Rate (CER) is a metric of the performance of an automatic speech recognition (ASR) system.

This value indicates the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CharErrorRate of 0 being a perfect score. Character error rate can then be computed as:

is the number of substitutions,

is the number of deletions,

is the number of insertions,

is the number of correct characters,

is the number of characters in the reference (N=S+D+C).

Compute CharErrorRate score of transcribed segments against references.

kwargs¶ ( Any ) – Additional keyword arguments, see Advanced metric settings for more info.

Character error rate score

Initializes internal Module state, shared by both nn.Module and ScriptModule.

Calculate the character error rate.

Character error rate score

Store references/predictions for computing Character Error Rate scores.

preds¶ ( Union [ str , List [ str ]]) – Transcription(s) to score as a string or list of strings

target¶ ( Union [ str , List [ str ]]) – Reference(s) for each speech input as a string or list of strings

Functional Interface¶

character error rate is a common metric of the performance of an automatic speech recognition system. This value indicates the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

preds¶ ( Union [ str , List [ str ]]) – Transcription(s) to score as a string or list of strings

target¶ ( Union [ str , List [ str ]]) – Reference(s) for each speech input as a string or list of strings

Источник

jiwer 2.5.1

pip install jiwer Скопировать инструкции PIP

Выпущен: 6 сент. 2022 г.

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)

Навигация

Ссылки проекта

Статистика

Метаданные

Лицензия: Apache Software License (Apache-2.0)

Требует: Python >=3.7, nikvaessen

Классификаторы

  • License
    • OSI Approved :: Apache Software License
  • Programming Language
    • Python :: 3
    • Python :: 3.7
    • Python :: 3.8
    • Python :: 3.9
    • Python :: 3.10

Описание проекта

JiWER: Similarity measures for automatic speech recognition evaluation

This repository contains a simple python package to approximate the Word Error Rate (WER), Match Error Rate (MER), Word Information Lost (WIL) and Word Information Preserved (WIP) of a transcript. It computes the minimum-edit distance between the ground-truth sentence and the hypothesis sentence of a speech-to-text API. The minimum-edit distance is calculated using the Python C module Levenshtein.

Installation

You should be able to install this package using poetry:

Or, if you prefer old-fashioned pip and you’re using Python >= 3.7 :

Usage

The most simple use-case is computing the edit distance between two strings:

Similarly, to get other measures:

You can also compute the WER over multiple sentences:

We also provide the character error rate:

pre-processing

It might be necessary to apply some pre-processing steps on either the hypothesis or ground truth text. This is possible with the transformation API:

By default, the following transformation is applied to both the ground truth and the hypothesis. Note that is simply to get it into the right format to calculate the WER.

transforms

We provide some predefined transforms. See jiwer.transformations .

Compose

jiwer.Compose(transformations: List[Transform]) can be used to combine multiple transformations.

Note that each transformation needs to end with jiwer.ReduceToListOfListOfWords() , as the library internally computes the word error rate based on a double list of words. `

ReduceToListOfListOfWords

jiwer.ReduceToListOfListOfWords(word_delimiter=» «) can be used to transform one or more sentences into a list of lists of words. The sentences can be given as a string (one sentence) or a list of strings (one or more sentences). This operation should be the final step of any transformation pipeline as the library internally computes the word error rate based on a double list of words.

ReduceToSingleSentence

jiwer.ReduceToSingleSentence(word_delimiter=» «) can be used to transform multiple sentences into a single sentence. The sentences can be given as a string (one sentence) or a list of strings (one or more sentences). This operation can be useful when the number of ground truth sentences and hypothesis sentences differ, and you want to do a minimal alignment over these lists. Note that this creates an invariance: wer([a, b], [a, b]) might not be equal to wer([b, a], [b, a]) .

RemoveSpecificWords

jiwer.RemoveSpecificWords(words_to_remove: List[str]) can be used to filter out certain words. As words are replaced with a character, make sure to that jiwer.RemoveMultipleSpaces , jiwer.Strip() and jiwer.RemoveEmptyStrings are present in the composition after jiwer.RemoveSpecificWords .

RemoveWhiteSpace

jiwer.RemoveWhiteSpace(replace_by_space=False) can be used to filter out white space. The whitespace characters are , t , n , r , x0b and x0c . Note that by default space ( ) is also removed, which will make it impossible to split a sentence into a list of words by using ReduceToListOfListOfWords or ReduceToSingleSentence . This can be prevented by replacing all whitespace with the space character. If so, make sure that jiwer.RemoveMultipleSpaces , jiwer.Strip() and jiwer.RemoveEmptyStrings are present in the composition after jiwer.RemoveWhiteSpace .

RemovePunctuation

jiwer.RemovePunctuation() can be used to filter out punctuation. The punctuation characters are defined as all unicode characters whose catogary name starts with P . See https://www.unicode.org/reports/tr44/#General_Category_Values.

RemoveMultipleSpaces

jiwer.RemoveMultipleSpaces() can be used to filter out multiple spaces between words.

Strip

jiwer.Strip() can be used to remove all leading and trailing spaces.

RemoveEmptyStrings

jiwer.RemoveEmptyStrings() can be used to remove empty strings.

ExpandCommonEnglishContractions

jiwer.ExpandCommonEnglishContractions() can be used to replace common contractions such as let’s to let us .

Currently, this method will perform the following replacements. Note that ␣ is used to indicate a space ( ) to get around markdown rendering constrains.

Источник

Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER)

Key concepts, examples, and Python implementation of measuring Optical Character Recognition output quality

Contents

Importance of Evaluation Metrics

Great job in successfully generating output from your OCR model! You have done the hard work of labeling and pre-processing the images, setting up and running your neural network, and applying post-processing on the output.

The final step now is to assess how well your model has performed. Even if it gave high confidence scores, we need to measure performance with objective metrics. Since you cannot improve what you do not measure, these metrics serve as a vital benchmark for the iterative improvement of your OCR model.

In this article, we will look at two metrics used to evaluate OCR output, namely Character Error Rate (CER) and Word Error Rate (WER).

Error Rates and Levenshtein Distance

The usual way of evaluating prediction output is with the accuracy metric, where we indicate a match ( 1) or a no match ( 0). However, this does not provide enough granularity to assess OCR performance effectively.

We should instead use error rates to determine the extent to which the OCR transcribed text and ground truth text (i.e., reference text labeled manually) differ from each other.

A common intuition is to see how many characters were misspelled. While this is correct, the actual error rate calculation is more complex than that. This is because the OCR output can have a different length from the ground truth text.

Furthermore, there are three different types of error to consider:

  • Substitution error: Misspelled characters/words
  • Deletion error: Lost or missing characters/words
  • Insertion error: Incorrect inclusion of character/words

The question now is, how do you measure the extent of errors between two text sequences? This is where Levenshtein distance enters the picture.

Levenshtein distance is a distance metric measuring the difference between two string sequences. It is the minimum number of single-character (or word) edits (i.e., insertions, deletions, or substitutions) required to change one word (or sentence) into another.

For example, the Levenshtein distance between “ mitten” and “ fitting” is 3 since a minimum of 3 edits is needed to transform one into the other.

The more different the two text sequences are, the higher the number of edits needed, and thus the larger the Levenshtein distance.

Character Error Rate (CER)

(i) Equation

CER calculation is based on the concept of Levenshtein distance, where we count the minimum number of character-level operations required to transform the ground truth text (aka reference text) into the OCR output.

It is represented with this formula:

  • S = Number of Substitutions
  • D = Number of Deletions
  • I = Number of Insertions
  • N = Number of characters in reference text (aka ground truth)

Bonus Tip: The denominator N can alternatively be computed with:
N = S + D + C (where C = number of correct characters)

The output of this equation represents the percentage of characters in the reference text that was incorrectly predicted in the OCR output. The lower the CER value (with 0 being a perfect score), the better the performance of the OCR model.

(ii) Illustration with Example

Let’s look at an example:

Several errors require edits to transform OCR output into the ground truth:

  1. g instead of 9 (at reference text character 3)
  2. Missing 1 (at reference text character 7)
  3. Z instead of 2 (at reference text character 8)

With that, here are the values to input into the equation:

  • Number of Substitutions (S) = 2
  • Number of Deletions ( D) = 1
  • Number of Insertions ( I) = 0
  • Number of characters in reference text ( N) = 9

Based on the above, we get (2 + 1 + 0) / 9 = 0.3333. When converted to a percentage value, the CER becomes 33.33%. This implies that every 3rd character in the sequence was incorrectly transcribed.

We repeat this calculation for all the pairs of transcribed output and corresponding ground truth, and take the mean of these values to obtain an overall CER percentage.

(iii) CER Normalization

One thing to note is that CER values can exceed 100%, especially with many insertions. For example, the CER for ground truth ‘ ABC’ and a longer OCR output ‘ ABC12345’ is 166.67%.

It felt a little strange to me that an error value can go beyond 100%, so I looked around and managed to come across an article by Rafael C. Carrasco that discussed how normalization could be applied:

Sometimes the number of mistakes is divided by the sum of the number of edit operations ( i + s + d ) and the number c of correct symbols, which is always larger than the numerator.

The normalization technique described above makes CER values fall within the range of 0–100% all the time. It can be represented with this formula:

where C = Number of correct characters

(iv) What is a good CER value?

There is no single benchmark for defining a good CER value, as it is highly dependent on the use case. Different scenarios and complexity (e.g., printed vs. handwritten text, type of content, etc.) can result in varying OCR performances. Nonetheless, there are several sources that we can take reference from.

An article published in 2009 on the review of OCR accuracy in large-scale Australian newspaper digitization programs came up with these benchmarks (for printed text):

  • Good OCR accuracy: CER 1‐2% (i.e. 98–99% accurate)
  • Average OCR accuracy: CER 2-10%
  • Poor OCR accuracy: CER >10% (i.e. below 90% accurate)

For complex cases involving handwritten text with highly heterogeneous and out-of-vocabulary content (e.g., application forms), a CER value as high as around 20% can be considered satisfactory.

Word Error Rate (WER)

If your project involves transcription of particular sequences (e.g., social security number, phone number, etc.), then the use of CER will be relevant.

On the other hand, Word Error Rate might be more applicable if it involves the transcription of paragraphs and sentences of words with meaning (e.g., pages of books, newspapers).

The formula for WER is the same as that of CER, but WER operates at the word level instead. It represents the number of word substitutions, deletions, or insertions needed to transform one sentence into another.

WER is generally well-correlated with CER (provided error rates are not excessively high), although the absolute WER value is expected to be higher than the CER value.

  • Ground Truth: ‘my name is kenneth’
  • OCR Output: ‘myy nime iz kenneth’

From the above, the CER is 16.67%, whereas the WER is 75%. The WER value of 75% is clearly understood since 3 out of 4 words in the sentence were wrongly transcribed.

Python Example (with TesseractOCR and fastwer)

We have covered enough theory, so let’s look at an actual Python code implementation.

In the demo notebook, I ran the open-source TesseractOCR model to extract output from several sample images of handwritten text. I then utilized the fastwer package to calculate CER and WER from the transcribed output and ground truth text (which I labeled manually).

Summing it up

In this article, we covered the concepts and examples of CER and WER and details on how to apply them in practice.

While CER and WER are handy, they are not bulletproof performance indicators of OCR models. This is because the quality and condition of the original documents (e.g., handwriting legibility, image DPI, etc.) play an equally (if not more) important role than the OCR model itself.

I welcome you to join me on a data science learning journey! Give this Medium page a follow to stay in the loop of more data science content, or reach out to me on LinkedIn. Have fun evaluating your OCR model!

Источник

Project description

A PyPI package for fast word/character error rate (WER/CER) calculation

  • fast (cpp implementation)
  • sentence-level and corpus-level WER/CER scores

Installation

pip install fastwer

Example

import fastwer
hypo = ['This is an example .', 'This is another example .']
ref = ['This is the example :)', 'That is the example .']

# Corpus-Level WER: 40.0
fastwer.score(hypo, ref)
# Corpus-Level CER: 25.5814
fastwer.score(hypo, ref, char_level=True)

# Sentence-Level WER: 40.0
fastwer.score_sent(hypo[0], ref[0])
# Sentence-Level CER: 22.7273
fastwer.score_sent(hypo[0], ref[0], char_level=True)

Contact

Changhan Wang (wangchanghan@gmail.com)

Download files

Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

Char Error Rate¶

Module Interface¶

Character Error Rate (CER) is a metric of the performance of an automatic speech recognition (ASR) system.

This value indicates the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CharErrorRate of 0 being a perfect score. Character error rate can then be computed as:

is the number of substitutions,

is the number of deletions,

is the number of insertions,

is the number of correct characters,

is the number of characters in the reference (N=S+D+C).

Compute CharErrorRate score of transcribed segments against references.

kwargs¶ ( Any ) – Additional keyword arguments, see Advanced metric settings for more info.

Character error rate score

Initializes internal Module state, shared by both nn.Module and ScriptModule.

Calculate the character error rate.

Character error rate score

Store references/predictions for computing Character Error Rate scores.

preds¶ ( Union [ str , List [ str ]]) – Transcription(s) to score as a string or list of strings

target¶ ( Union [ str , List [ str ]]) – Reference(s) for each speech input as a string or list of strings

Functional Interface¶

character error rate is a common metric of the performance of an automatic speech recognition system. This value indicates the percentage of characters that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a CER of 0 being a perfect score.

preds¶ ( Union [ str , List [ str ]]) – Transcription(s) to score as a string or list of strings

target¶ ( Union [ str , List [ str ]]) – Reference(s) for each speech input as a string or list of strings

Источник

jiwer 2.5.1

pip install jiwer Скопировать инструкции PIP

Выпущен: 6 сент. 2022 г.

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)

Навигация

Ссылки проекта

Статистика

Метаданные

Лицензия: Apache Software License (Apache-2.0)

Требует: Python >=3.7, nikvaessen

Классификаторы

  • License
    • OSI Approved :: Apache Software License
  • Programming Language
    • Python :: 3
    • Python :: 3.7
    • Python :: 3.8
    • Python :: 3.9
    • Python :: 3.10

Описание проекта

JiWER: Similarity measures for automatic speech recognition evaluation

This repository contains a simple python package to approximate the Word Error Rate (WER), Match Error Rate (MER), Word Information Lost (WIL) and Word Information Preserved (WIP) of a transcript. It computes the minimum-edit distance between the ground-truth sentence and the hypothesis sentence of a speech-to-text API. The minimum-edit distance is calculated using the Python C module Levenshtein.

Installation

You should be able to install this package using poetry:

Or, if you prefer old-fashioned pip and you’re using Python >= 3.7 :

Usage

The most simple use-case is computing the edit distance between two strings:

Similarly, to get other measures:

You can also compute the WER over multiple sentences:

We also provide the character error rate:

pre-processing

It might be necessary to apply some pre-processing steps on either the hypothesis or ground truth text. This is possible with the transformation API:

By default, the following transformation is applied to both the ground truth and the hypothesis. Note that is simply to get it into the right format to calculate the WER.

transforms

We provide some predefined transforms. See jiwer.transformations .

Compose

jiwer.Compose(transformations: List[Transform]) can be used to combine multiple transformations.

Note that each transformation needs to end with jiwer.ReduceToListOfListOfWords() , as the library internally computes the word error rate based on a double list of words. `

ReduceToListOfListOfWords

jiwer.ReduceToListOfListOfWords(word_delimiter=» «) can be used to transform one or more sentences into a list of lists of words. The sentences can be given as a string (one sentence) or a list of strings (one or more sentences). This operation should be the final step of any transformation pipeline as the library internally computes the word error rate based on a double list of words.

ReduceToSingleSentence

jiwer.ReduceToSingleSentence(word_delimiter=» «) can be used to transform multiple sentences into a single sentence. The sentences can be given as a string (one sentence) or a list of strings (one or more sentences). This operation can be useful when the number of ground truth sentences and hypothesis sentences differ, and you want to do a minimal alignment over these lists. Note that this creates an invariance: wer([a, b], [a, b]) might not be equal to wer([b, a], [b, a]) .

RemoveSpecificWords

jiwer.RemoveSpecificWords(words_to_remove: List[str]) can be used to filter out certain words. As words are replaced with a character, make sure to that jiwer.RemoveMultipleSpaces , jiwer.Strip() and jiwer.RemoveEmptyStrings are present in the composition after jiwer.RemoveSpecificWords .

RemoveWhiteSpace

jiwer.RemoveWhiteSpace(replace_by_space=False) can be used to filter out white space. The whitespace characters are , t , n , r , x0b and x0c . Note that by default space ( ) is also removed, which will make it impossible to split a sentence into a list of words by using ReduceToListOfListOfWords or ReduceToSingleSentence . This can be prevented by replacing all whitespace with the space character. If so, make sure that jiwer.RemoveMultipleSpaces , jiwer.Strip() and jiwer.RemoveEmptyStrings are present in the composition after jiwer.RemoveWhiteSpace .

RemovePunctuation

jiwer.RemovePunctuation() can be used to filter out punctuation. The punctuation characters are defined as all unicode characters whose catogary name starts with P . See https://www.unicode.org/reports/tr44/#General_Category_Values.

RemoveMultipleSpaces

jiwer.RemoveMultipleSpaces() can be used to filter out multiple spaces between words.

Strip

jiwer.Strip() can be used to remove all leading and trailing spaces.

RemoveEmptyStrings

jiwer.RemoveEmptyStrings() can be used to remove empty strings.

ExpandCommonEnglishContractions

jiwer.ExpandCommonEnglishContractions() can be used to replace common contractions such as let’s to let us .

Currently, this method will perform the following replacements. Note that ␣ is used to indicate a space ( ) to get around markdown rendering constrains.

Источник

jitsi/jiwer

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more.

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

JiWER: Similarity measures for automatic speech recognition evaluation

This repository contains a simple python package to approximate the Word Error Rate (WER), Match Error Rate (MER), Word Information Lost (WIL) and Word Information Preserved (WIP) of a transcript. It computes the minimum-edit distance between the ground-truth sentence and the hypothesis sentence of a speech-to-text API. The minimum-edit distance is calculated using the Python C module Levenshtein.

You should be able to install this package using poetry:

Or, if you prefer old-fashioned pip and you’re using Python >= 3.7 :

The most simple use-case is computing the edit distance between two strings:

Similarly, to get other measures:

You can also compute the WER over multiple sentences:

We also provide the character error rate:

It might be necessary to apply some pre-processing steps on either the hypothesis or ground truth text. This is possible with the transformation API:

By default, the following transformation is applied to both the ground truth and the hypothesis. Note that is simply to get it into the right format to calculate the WER.

We provide some predefined transforms. See jiwer.transformations .

jiwer.Compose(transformations: List[Transform]) can be used to combine multiple transformations.

Note that each transformation needs to end with jiwer.ReduceToListOfListOfWords() , as the library internally computes the word error rate based on a double list of words. `

jiwer.ReduceToListOfListOfWords(word_delimiter=» «) can be used to transform one or more sentences into a list of lists of words. The sentences can be given as a string (one sentence) or a list of strings (one or more sentences). This operation should be the final step of any transformation pipeline as the library internally computes the word error rate based on a double list of words.

jiwer.ReduceToSingleSentence(word_delimiter=» «) can be used to transform multiple sentences into a single sentence. The sentences can be given as a string (one sentence) or a list of strings (one or more sentences). This operation can be useful when the number of ground truth sentences and hypothesis sentences differ, and you want to do a minimal alignment over these lists. Note that this creates an invariance: wer([a, b], [a, b]) might not be equal to wer([b, a], [b, a]) .

jiwer.RemoveSpecificWords(words_to_remove: List[str]) can be used to filter out certain words. As words are replaced with a character, make sure to that jiwer.RemoveMultipleSpaces , jiwer.Strip() and jiwer.RemoveEmptyStrings are present in the composition after jiwer.RemoveSpecificWords .

jiwer.RemoveWhiteSpace(replace_by_space=False) can be used to filter out white space. The whitespace characters are , t , n , r , x0b and x0c . Note that by default space ( ) is also removed, which will make it impossible to split a sentence into a list of words by using ReduceToListOfListOfWords or ReduceToSingleSentence . This can be prevented by replacing all whitespace with the space character. If so, make sure that jiwer.RemoveMultipleSpaces , jiwer.Strip() and jiwer.RemoveEmptyStrings are present in the composition after jiwer.RemoveWhiteSpace .

jiwer.RemovePunctuation() can be used to filter out punctuation. The punctuation characters are defined as all unicode characters whose catogary name starts with P . See https://www.unicode.org/reports/tr44/#General_Category_Values.

jiwer.RemoveMultipleSpaces() can be used to filter out multiple spaces between words.

jiwer.Strip() can be used to remove all leading and trailing spaces.

jiwer.RemoveEmptyStrings() can be used to remove empty strings.

jiwer.ExpandCommonEnglishContractions() can be used to replace common contractions such as let’s to let us .

Currently, this method will perform the following replacements. Note that ␣ is used to indicate a space ( ) to get around markdown rendering constrains.

Contraction transformed into
won’t ␣will not
can’t ␣can not
let’s ␣let us
n’t ␣not
‘re ␣are
‘s ␣is
‘d ␣would
‘ll ␣will
‘t ␣not
‘ve ␣have
‘m ␣am

jiwer.SubstituteWords(dictionary: Mapping[str, str]) can be used to replace a word into another word. Note that the whole word is matched. If the word you’re attempting to substitute is a substring of another word it will not be affected. For example, if you’re substituting foo into bar , the word foobar will NOT be substituted into barbar .

jiwer.SubstituteRegexes(dictionary: Mapping[str, str]) can be used to replace a substring matching a regex expression into another substring.

jiwer.ToLowerCase() can be used to convert every character into lowercase.

jiwer.ToUpperCase() can be used to replace every character into uppercase.

jiwer.RemoveKaldiNonWords() can be used to remove any word between [] and <> . This can be useful when working with hypotheses from the Kaldi project, which can output non-words such as [laugh] and .

About

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)

Источник

kahne/fastwer

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more.

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

A PyPI package for fast word/character error rate (WER/CER) calculation

  • fast (cpp implementation)
  • sentence-level and corpus-level WER/CER scores

About

A PyPI package for fast word/character error rate (WER/CER) calculation

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Used by 65

Languages

Footer

© 2023 GitHub, Inc.

You can’t perform that action at this time.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.

Источник

Your question is framed as though the most fruitful avenue for code improvement
lies in the area of modifying the style of iteration: maybe list comprehension,
maybe for-loop, maybe double list comprehension. To my eye, however, the biggest
problem with the code is its proliferation of similar variables. Whenever your
code has so many variables like this, it’s a sign that you should step back and
get the data better organized — either in the form of collections or data
objects.

As best I can tell you have 16 variables, which come in groups of four: number
of sentences, number of ok sentences, number of characters, and number of error
characters. You have four of those groups: regular, long, short, and not1. Each
of those groups behaves like a tally: as you loop through the data, you need to
compute a few values (size of y_true, correctness, and edit distance) and
then use those values to update one or more tallies. (Because you have not
given us much context, some of my terminology choices might be poor, but
hopefully you can understand the general point and modify the vocabulary
accordingly.)

One simple idea is to create a basic data object that can perform the
calculations needed for the updating process. This object is not super
important for the plan I’m suggesting, but it does help to move detail out of
the primary data-processing loop. That’s usually a good move: shift algorithmic
detail down to simple functions or classes, and leave the data-processing loop
focused on reading and looping.

class Update:

    def __init__(self, y_true, y_pred):
        self.size = len(y_true)
        self.is_correct = y_true.replace(' ', '') == y_pred.replace(' ', '')
        self.dist = editdistance.eval(y_true, y_pred)

More important is some type of data object to represent your ongoing tallies.
A Tally would have a type or kind, and it would know how to update its counts.

class Tally:

    REGULAR = 'regular'
    LONG = 'long'
    SHORT = 'short'
    NOT1 = 'not1'

    def __init__(self, kind):
        self.kind = kind
        self.nsent = 0
        self.nsent_ok = 0
        self.nchar = 0
        self.nchar_err = 0

    def update(self, u):
        self.nsent += 1
        self.nsent_ok += u.is_correct
        self.nchar += u.size
        self.nchar_err += u.dist

The updating portion of the data-processing loop would look like this:

reg = Tally(Tally.REGULAR)
long = Tally(Tally.LONG)
short = Tally(Tally.SHORT)
not1 = Tally(Tally.NOT1)

while loader.has_next():
    ...
    for y_true, y_pred in zip(batch.gt_texts, recognized):
        u = Update(y_true, y_pred)
        reg.update(u)
        if u.size != 1:
            not1.update(u)
        (short if u.size < long_threshold else long).update(u)

Notice that the deployment of simple data objects like Update
and Tally do more than improve code readability. They also
create handy vehicles for further additions to functionality.
For example, if you have Tally.nsent and Tally.nsent_ok
you can easily add a property to give you their difference
on demand. You can add this new behavior in a single place,
without having to modify multiple places in the
program where you might want to create, manage, and use
this additional attribute.

class Tally:

    @property
    def nsent_nok(self):
        return self.nsent - self.nsent_ok

Понравилась статья? Поделить с друзьями:
  • Character ai chat error
  • Char error rate
  • Char convert error in smb 018 763
  • Chaos group installer error
  • Channel private error