web-scrapper

September 22, 2023 (1y ago)

0 views

Overview

This (repository) contains a web scraper implemented across three different languages: Rust, Perl, and Python. Each component performs specific tasks within the web scraping pipeline and communicates via encrypted IPC (Inter-Process Communication).

  • Rust: Manages the HTTP requests and initial data storage.
  • Perl: Parses the HTML content and saves reply logs.
  • Python: Handles additional data processing and saves to Parquet files.

File Structure

.
├── rust_part
   ├── Cargo.toml
   └── main.rs
├── perl_part
   └── main.pl
└── python_part
    └── main.py
def save_to_parquet(data, request_id, timestamp):
    random_digits = uuid.uuid4().int & (1<<10) - 1  # Generate random 10-bit integer
    file_name = f"data/{request_id}_{random_digits}_{timestamp}.parquet"

    # Convert the data to DataFrame and then to a Parquet table
    df = pd.DataFrame([data])
    table = Table.from_pandas(df)
    
    # Save the table to a Parquet file
    pq.write_table(table, file_name)

We generate a unique crypto key for every session, so therefore, it is impossible to crack.

# Generate a key for encryption and decryption
key = Fernet.generate_key()
cipher_suite = Fernet(key)


def encrypt_data(data):
    encrypted_data = cipher_suite.encrypt(data.encode())
    return encrypted_data

def decrypt_data(encrypted_data):
    decrypted_data = cipher_suite.decrypt(encrypted_data).decode()
    return decrypted_data