Almost exactly a year ago, I wrote about my first foray into implementing DMARC controls. Specific to domains through which email was not intended to be sent, it was the beginning of my DMARC adventure and expansion into some 35-ish domains.
This became its own series of posts with time:
Tightened Controls
In the last few weeks, I tightened down DMARC policy to the most restrictive level (reject
) on most of my monitored and active domains. This was semi-planned as I noted in the Nine Months In post. What this activity also prompted was finally processing the reports I've collected over the last year (approximately 1,100 of them).
1,100 Files!? Where to Start?
There's obviously no reason to try manually analyzing that many files. It's definitely script time!
Of the 1,100 files, they're split among seven distinct 'services' or segments. The "top two" account for about 450 and 350 files alone, respectively. The rest trail off pretty quickly (#3 accounts for just over 150), and the lowest reporter (primarily a full alias domain and shouldn't be sending any email) only has 8 report files.
So I chose to start building a script to look at the lowest one, with only 8 files to analyze. This would make any hiccups less...gnarly, but also make it a lot simpler to identify any weirdness in the data as I built out the parsing mechanism. This became really important as I started to walk the array (below).
The Script Itself
I want the script to do three things automatically:
- Recursively load all the XML report files into a big array I can walk/parse;
- Spit out some simple statistics of and pointers to the reports including DKIM/SPF fail/fail combinations; and
- Spit out a CSV file of the failures I can use to further analyze with PowerBI or the like.
As I usually do, I wrote the script in PHP. There are other ways to do the same thing, but PHP works fabulously for me since I have WSL and VSCode set up on my machine. I can do it all right in the same window that way, and it's "easy" for me.
Loading the XML Files
It's been a while since I worked with XML data in PHP. So my first hurdle was to load several hundred files recursively into an array. I ended up writing a script based on some info I found in a Google search to do just that:
function recursiveGlobXMLToArray($dir, $ext = 'xml') {
global $reportArray;
$globFiles = glob("$dir/*.$ext");
$globDirs = glob("$dir/*", GLOB_ONLYDIR);
foreach ($globDirs as $dir) {
recursiveGlobXMLToArray($dir);
}
foreach ($globFiles as $file) {
$reportName = explode($ext, basename($file))[0];
$reportArray[$file] = json_decode(json_encode(simplexml_load_file($file)), true);
}
}
The line containing the magic is the last in the second foreach
, though:
$reportArray[$file] = json_decode(json_encode(simplexml_load_file($file)), true);
This magic might seem excessive, but the business of converting the parsed XML to JSON and then decoding the JSON to a PHP array is pure genius! At a glance it doesn't necessarily seem like it should behave, but it works great!!
Walk The Array
Running through the resulting super array is pretty standard-issue stuff. The only oddity I discovered is that, presumably due to how the JSON is encoded from the origin XML, when a report has more than one record the resulting super array has a slightly different structure than I was expecting if It'd been created more traditionally. Wrapping my head around the details and structure was the worst part of the whole process, and definitely a point in which I was happy to only be looking at 8 files. In the end I came up with something I'm satisfied with, and it centers around an "extra" if
statement to detect the subtle structural difference:
if (!isset($reportDetail['record']['row']))
The if
and else
sections address obtaining the proper data from the different structures and output them in a common way for further processing.
Spit Out Results
Typically I'd just run this script from the console, so I'm looking for a little data right there. In particular, I wanted to see the count of failures, the source of said failures, and the date/file in which these were reported. Pretty simple with an inline print
statement:
print date('Y-m-d h:m:s', $summaryData['end']) . " : $summaryData[count] failures from source IP $summaryData[source_ip] as $summaryData[header_from] (see $summaryData[filename])\n";
This format would spit out the following summary (included in the overall console output):
Total Report Files Scanned: 1
Distinct Reports with 'fail/fail' Records: 1
Total number of failures: 2
2021-11-01 06:11:59 : 2 failures from source IP 234.56.78.90 as example.com (see google.com!example.com!1635724800!1635811199.xml)
Failure analysis dataset created at /path/to/dmarc-policyanalyzer/rawfailuredata.csv
Analysis completed in 0.0214 seconds.
For Further Analysis Data
I have the script write out a simple CSV with the same summary data as output to the console, but with a few additional details in case I want to do some filtering in Excel, Google Sheets, or PowerBI:
org_name,begin_timestamp,end_timestamp,end_datetime,end_year,end_month,end_day,domain_policy,p,sp,source_ip,count,disposition,header_from
google.com,1635724800,1635811199,"2021-11-01 06:11:59",2021,11,01,example.com,reject,quarantine,234.56.78.90,2,reject,example.com
It Works Great!
For grins, I added a quick process evaluation timestamp just to see how long it'd take to rifle through several hundred files. Typically, it's about a second per 100 files.
I noticed that my 'gut' feeling as indicated in the last post is pretty accurate. Outright failures across the board happen with less frequency than a year ago, and when they do happen they also tend to cluster in significantly smaller number. Most failures now are 1/source address; one failure report back in March (around the time I started moving policies to quarantine
) peaked at 50 failures for one source IP.
I have put the script along with some example files in a Github repo. Feel free to use, modify, or inspire for other like analysis!
What's Next?
Honestly, it's unclear. Sometime over the winter I'll further crank down the trailing couple domains still using a quarantine
policy (after re-evaluating them with this script, of course). Beyond that, though, I'm not sure. I have a separate and distinct email account collecting the reports, so rather than parse them out daily I might change that cadence (or even pull some domains from reporting). Time will tell, but the entire process has been quite a learning opportunity, and definitely helped reduce some spam along the way. But most importantly, I'm self-protecting the use of these domains to my own needs.