Overview
This (repository) contains a web scraper implemented across three different languages: Rust, Perl, and Python. Each component performs specific tasks within the web scraping pipeline and communicates via encrypted IPC (Inter-Process Communication).
- Rust: Manages the HTTP requests and initial data storage.
- Perl: Parses the HTML content and saves reply logs.
- Python: Handles additional data processing and saves to Parquet files.
File Structure
.
├── rust_part
│ ├── Cargo.toml
│ └── main.rs
├── perl_part
│ └── main.pl
└── python_part
└── main.py
def save_to_parquet(data, request_id, timestamp):
random_digits = uuid.uuid4().int & (1<<10) - 1 # Generate random 10-bit integer
file_name = f"data/{request_id}_{random_digits}_{timestamp}.parquet"
# Convert the data to DataFrame and then to a Parquet table
df = pd.DataFrame([data])
table = Table.from_pandas(df)
# Save the table to a Parquet file
pq.write_table(table, file_name)
We generate a unique crypto key for every session, so therefore, it is impossible to crack.
# Generate a key for encryption and decryption
key = Fernet.generate_key()
cipher_suite = Fernet(key)
def encrypt_data(data):
encrypted_data = cipher_suite.encrypt(data.encode())
return encrypted_data
def decrypt_data(encrypted_data):
decrypted_data = cipher_suite.decrypt(encrypted_data).decode()
return decrypted_data