An analysis of 35,000 leaked passwords from UC Berkeley students

Jul 20, 2020 Shomil Jain

In Fall 2018, TechCrunch reported that Chegg confirmed a data breach affecting nearly 40 million customers, where hackers gained access to an internal database containing user emails and passwords. Earlier this year, a link was posted to numerous Berkeley-related websites (Reddit, Facebook Pages) containing a file of 35,000 UC Berkeley emails and passwords, reportedly sourced from the Chegg breach. In an effort to analyze the habits of UC Berkeley students through the lens of password security, I downloaded, anonymized, and analyzed the list of passwords.

View this project on GitHub.

How This Happened

When storing passwords, apps and websites typically use a cryptographic hash function (ex: SHA-512) to store a hashed version of the password, rather than the password itself. As a result, if any developer or third party were able to access a database of these passwords, they wouldn’t be able to decipher the hash due to the one-way property of cryptographic hashes.

However, this basic implementation is vulnerable to a dictionary attack, where an attacker pre-computes a list of common passwords – and then uses that list to identify matches from a list of hashed passwords.

For example, the password gobears hashes to c6685d07f089a532755ccfe3b35f6d42 using the MD5 hash. From this point on, if an attacker came upon a hash of c6685d07f089a532755ccfe3b35f6d42, they’d know that the original, plaintext password was gobears. Given a large table of these pre-computed password-hash pairs for common passwords, it would be trivial to perform constant-time lookups from any password database storing passwords using this basic implementation.

To avoid this, companies usually generate a large random string (typically referred to as a salt) and append it to the plaintext password before generating the hash. Assuming the salt is randomly generated, this implementation prevents dictionary attacks from occurring, as the salt adds a component of randomness to each password hash, effectively making each hash unique.

Basic: hash(password)

Secure: (salt, hash(password + salt))

In the case of Chegg, given the scale of the password leak, it’s a reasonable assumption that the company chose the basic implementation over the secure implementation – and an attacker was able to use a dictionary attack on a leaked database of accounts to recover plaintext passwords.

Statistics

Statistic	Value
Total # Passwords Leaked	34820
Average Password Length	8.944
Average Password Strength (Source)	0.301
% Passwords with One or More Number	78%
% Passwords with One or More Special Character	13%
Number of Passwords containing ‘password’	127
Number of Passwords containing ‘bears’	186
Number of Passwords containing ‘berkeley’	181

Most Common Passwords

Password	Count
password	60
123456	38
courserank	36
berkeley	32
gobears	24
evite	23
zinch	17
soccer	15

Histograms

Password Length

Password Strength

Identifying Common Structures

A common password structure involves appending the name of the website to a password base that’s shared across a number of websites. For example, a user with a base password of albert17!!pw may use albert17!!pwCG for Chegg and albert17!!pwGM for Gmail. Identifying passwords that follow these structures could be useful for identifying access points across a number of different websites users may have accounts for.

Statistic (not case sensitive)	Value
Passwords containing “chegg”	121
Passwords containing “cg”	28